Build a Comprehensive Data Cleaning Pipeline for Raw Datasets

Design a robust, end-to-end data cleaning pipeline with validation rules, logging, automation, and quality checks for any dataset.

๐Ÿ“ The Prompt

You are a senior data engineer. Design a complete, step-by-step data cleaning pipeline for a [DATASET_TYPE] dataset with approximately [NUMBER_OF_ROWS] rows and [NUMBER_OF_COLUMNS] columns. The dataset is sourced from [DATA_SOURCE] and will be used for [END_PURPOSE]. Please structure the pipeline with the following stages: 1. **Initial Assessment**: Outline how to profile the dataset, including checking data types, null percentages, duplicate counts, and basic statistical summaries. 2. **Standardization**: Define rules for standardizing [KEY_COLUMNS] โ€” including consistent formatting for dates, strings (case normalization, trimming whitespace), and numeric precision. 3. **Deduplication Strategy**: Propose a method for identifying and removing duplicate records, specifying whether to use exact matching or fuzzy matching and why. 4. **Validation Rules**: Create at least 5 domain-specific validation rules for [DOMAIN] data (e.g., valid ranges, referential integrity, regex patterns for fields like emails or phone numbers). 5. **Logging & Auditability**: Describe how to log every transformation applied, including before/after snapshots and row-level change tracking. 6. **Automation & Scheduling**: Recommend tools or frameworks (e.g., Apache Airflow, dbt, Pandas) suitable for [TECH_STACK] to automate this pipeline on a [FREQUENCY] basis. 7. **Quality Assurance**: Propose at least 3 post-cleaning quality checks with pass/fail criteria. For each stage, provide example Python or SQL code snippets where applicable. Highlight common pitfalls specific to [DATASET_TYPE] data and how to avoid them. Finally, provide a summary checklist that a team member can follow to execute the pipeline manually if automation fails.

๐Ÿ’ก Tips for Better Results

Always specify your dataset type and domain so the pipeline includes relevant validation rules. Include your tech stack to get actionable code snippets rather than generic advice. Run the pipeline on a small sample first before processing the full dataset.

๐ŸŽฏ Use Cases

Data engineers and analysts use this when building repeatable, production-grade data cleaning workflows for new or messy data sources.

๐Ÿ”— Related Prompts

๐Ÿ“Š Data & Analytics intermediate

Write Complex SQL Queries

Generate optimized SQL queries for complex analysis with CTEs, JOINs, and performance tips.

๐Ÿ“Š Data & Analytics intermediate

Python Data Analysis Script

Generate a complete Python data analysis pipeline with cleaning, visualization, and insights.

๐Ÿ“Š Data & Analytics intermediate

Create a Comprehensive Data Quality Checklist for Your Dataset

Generate a tailored data quality checklist with SQL validation queries, severity levels, and a scoring framework for any dataset.

๐Ÿ“Š Data & Analytics intermediate

Build an RFM Customer Segmentation Model for Targeted Marketing

Create a complete RFM customer segmentation model with scoring logic, code implementation, and marketing strategies.

๐Ÿ“Š Data & Analytics intermediate

Build an RFM Customer Segmentation Model

Create a complete RFM customer segmentation model with scoring logic, segment definitions, marketing actions, and code.

๐Ÿ“Š Data & Analytics intermediate

Design a Missing Value Imputation Strategy for Your Dataset

Get a tailored missing value imputation strategy with diagnosis, method selection, Python code, and validation for your dataset.