Build a Comprehensive Data Cleaning Pipeline for Raw Datasets

Design a robust, end-to-end data cleaning pipeline with validation rules, logging, automation, and quality checks for any dataset.

📝 The Prompt

You are a senior data engineer. Design a complete, step-by-step data cleaning pipeline for a [DATASET_TYPE] dataset with approximately [NUMBER_OF_ROWS] rows and [NUMBER_OF_COLUMNS] columns. The dataset is sourced from [DATA_SOURCE] and will be used for [END_PURPOSE]. Please structure the pipeline with the following stages: 1. **Initial Assessment**: Outline how to profile the dataset, including checking data types, null percentages, duplicate counts, and basic statistical summaries. 2. **Standardization**: Define rules for standardizing [KEY_COLUMNS] — including consistent formatting for dates, strings (case normalization, trimming whitespace), and numeric precision. 3. **Deduplication Strategy**: Propose a method for identifying and removing duplicate records, specifying whether to use exact matching or fuzzy matching and why. 4. **Validation Rules**: Create at least 5 domain-specific validation rules for [DOMAIN] data (e.g., valid ranges, referential integrity, regex patterns for fields like emails or phone numbers). 5. **Logging & Auditability**: Describe how to log every transformation applied, including before/after snapshots and row-level change tracking. 6. **Automation & Scheduling**: Recommend tools or frameworks (e.g., Apache Airflow, dbt, Pandas) suitable for [TECH_STACK] to automate this pipeline on a [FREQUENCY] basis. 7. **Quality Assurance**: Propose at least 3 post-cleaning quality checks with pass/fail criteria. For each stage, provide example Python or SQL code snippets where applicable. Highlight common pitfalls specific to [DATASET_TYPE] data and how to avoid them. Finally, provide a summary checklist that a team member can follow to execute the pipeline manually if automation fails.

💡 Tips for Better Results

Always specify your dataset type and domain so the pipeline includes relevant validation rules. Include your tech stack to get actionable code snippets rather than generic advice. Run the pipeline on a small sample first before processing the full dataset.