Design a Missing Value Imputation Strategy for Your Dataset

Get a tailored missing value imputation strategy with diagnosis, method selection, Python code, and validation for your dataset.

๐Ÿ“ The Prompt

Act as a data science consultant specializing in data preprocessing. I have a [DATASET_TYPE] dataset with [NUMBER_OF_ROWS] records used for [ANALYSIS_GOAL]. The following columns have missing values: - [COLUMN_1]: [MISSING_PERCENTAGE_1]% missing, data type: [DATA_TYPE_1] - [COLUMN_2]: [MISSING_PERCENTAGE_2]% missing, data type: [DATA_TYPE_2] - [COLUMN_3]: [MISSING_PERCENTAGE_3]% missing, data type: [DATA_TYPE_3] Please provide a comprehensive imputation strategy by addressing: 1. **Missingness Diagnosis**: Explain how to determine whether each column's missing data is MCAR, MAR, or MNAR. Provide specific statistical tests or visualizations to confirm. 2. **Strategy Selection**: For each column, recommend the most appropriate imputation method (e.g., mean/median, mode, KNN imputation, MICE, regression-based, forward/backward fill, or domain-specific rules). Justify each choice based on the data type, missingness pattern, and percentage. 3. **Implementation**: Provide Python code using pandas and scikit-learn for each recommended method. 4. **Validation**: Describe how to evaluate imputation quality โ€” including distribution comparison before and after imputation, impact on downstream model performance, and sensitivity analysis. 5. **When to Drop**: Define clear thresholds or criteria for when it is better to drop rows or columns entirely rather than impute. Include a comparison table summarizing the pros, cons, and ideal use cases for at least 5 imputation techniques.

๐Ÿ’ก Tips for Better Results

Clearly state the percentage of missing data and column types โ€” imputation strategies differ drastically between 5% and 50% missingness. Always validate imputation by comparing distributions before and after to ensure you haven't introduced bias. Consider creating a missingness indicator column as an additional feature for modeling.

๐ŸŽฏ Use Cases

Data scientists and analysts use this when preparing datasets with incomplete records for machine learning models or statistical analysis.

๐Ÿ”— Related Prompts

๐Ÿ“Š Data & Analytics intermediate

Write Complex SQL Queries

Generate optimized SQL queries for complex analysis with CTEs, JOINs, and performance tips.

๐Ÿ“Š Data & Analytics intermediate

Python Data Analysis Script

Generate a complete Python data analysis pipeline with cleaning, visualization, and insights.

๐Ÿ“Š Data & Analytics intermediate

Create a Comprehensive Data Quality Checklist for Your Dataset

Generate a tailored data quality checklist with SQL validation queries, severity levels, and a scoring framework for any dataset.

๐Ÿ“Š Data & Analytics intermediate

Build an RFM Customer Segmentation Model for Targeted Marketing

Create a complete RFM customer segmentation model with scoring logic, code implementation, and marketing strategies.

๐Ÿ“Š Data & Analytics intermediate

Build an RFM Customer Segmentation Model

Create a complete RFM customer segmentation model with scoring logic, segment definitions, marketing actions, and code.

๐Ÿ“Š Data & Analytics advanced

Create an Outlier Detection and Handling Framework for Your Data

Build a complete outlier detection and handling framework using statistical, visual, and ML methods with Python code.