Design a Missing Value Imputation Strategy for Your Dataset

Get a tailored missing value imputation strategy with diagnosis, method selection, Python code, and validation for your dataset.

📝 The Prompt

Act as a data science consultant specializing in data preprocessing. I have a [DATASET_TYPE] dataset with [NUMBER_OF_ROWS] records used for [ANALYSIS_GOAL]. The following columns have missing values: - [COLUMN_1]: [MISSING_PERCENTAGE_1]% missing, data type: [DATA_TYPE_1] - [COLUMN_2]: [MISSING_PERCENTAGE_2]% missing, data type: [DATA_TYPE_2] - [COLUMN_3]: [MISSING_PERCENTAGE_3]% missing, data type: [DATA_TYPE_3] Please provide a comprehensive imputation strategy by addressing: 1. **Missingness Diagnosis**: Explain how to determine whether each column's missing data is MCAR, MAR, or MNAR. Provide specific statistical tests or visualizations to confirm. 2. **Strategy Selection**: For each column, recommend the most appropriate imputation method (e.g., mean/median, mode, KNN imputation, MICE, regression-based, forward/backward fill, or domain-specific rules). Justify each choice based on the data type, missingness pattern, and percentage. 3. **Implementation**: Provide Python code using pandas and scikit-learn for each recommended method. 4. **Validation**: Describe how to evaluate imputation quality — including distribution comparison before and after imputation, impact on downstream model performance, and sensitivity analysis. 5. **When to Drop**: Define clear thresholds or criteria for when it is better to drop rows or columns entirely rather than impute. Include a comparison table summarizing the pros, cons, and ideal use cases for at least 5 imputation techniques.

💡 Tips for Better Results

Clearly state the percentage of missing data and column types — imputation strategies differ drastically between 5% and 50% missingness. Always validate imputation by comparing distributions before and after to ensure you haven't introduced bias. Consider creating a missingness indicator column as an additional feature for modeling.