Create a Comprehensive Data Quality Checklist for Your Dataset
Generate a tailored data quality checklist with SQL validation queries, severity levels, and a scoring framework for any dataset.
๐ The Prompt
You are a data quality specialist and governance expert. I need you to generate a thorough, actionable data quality checklist tailored to my specific dataset and business context.
**Dataset Details:**
- Dataset name/description: [DATASET_NAME_AND_DESCRIPTION, e.g., customer transactions table with purchase history]
- Number of records (approx.): [RECORD_COUNT, e.g., 2 million rows]
- Key columns/fields: [LIST_KEY_COLUMNS, e.g., customer_id, transaction_date, amount, product_sku, payment_method]
- Data source: [DATA_SOURCE, e.g., production MySQL database, third-party API]
- Business purpose: [HOW_DATA_IS_USED, e.g., revenue forecasting, customer segmentation]
**Generate the checklist organized into these six dimensions of data quality:**
1. **Completeness**: Identify which fields should never be null, acceptable null thresholds for optional fields, and checks for missing records or gaps in time-series data.
2. **Accuracy**: Define validation rules for each key field (e.g., range checks, format validation, regex patterns), cross-reference checks against trusted sources, and outlier detection criteria.
3. **Consistency**: Specify checks for referential integrity, cross-field logical consistency (e.g., end_date > start_date), standardization rules for categorical values, and consistency across related datasets.
4. **Timeliness**: Define expected data freshness SLAs, latency thresholds, checks for stale or delayed records, and monitoring for data arrival patterns.
5. **Uniqueness**: Specify primary key uniqueness checks, duplicate detection rules (exact and fuzzy matching criteria), and deduplication strategies.
6. **Validity**: Define domain-specific business rules, enumerated value checks, and schema conformance validations.
For each check item, provide:
- The specific check description
- A sample SQL query or pseudocode to implement it
- Severity level (Critical / Warning / Info)
- Recommended remediation action
Finally, suggest a scoring framework to calculate an overall data quality score from 0-100 based on weighted dimensions.
๐ก Tips for Better Results
List your most business-critical columns explicitly so the checklist prioritizes validations that matter most to downstream consumers.
Mention any known data issues or pain points (e.g., duplicate records, timezone mismatches) so the checklist addresses your real-world problems.
Specify your SQL dialect (e.g., PostgreSQL, BigQuery SQL, T-SQL) in the placeholder details to get copy-paste-ready queries.
๐ฏ Use Cases
Data analysts, analytics engineers, and data stewards should use this when establishing data quality monitoring for a new or existing dataset, or when onboarding a new data source into their warehouse.