Design a Robust ETL Pipeline Architecture for Your Data Warehouse
Design a scalable ETL pipeline architecture with extraction strategies, transformations, error handling, and orchestration plans.
๐ The Prompt
You are a senior data engineer with 10+ years of experience designing scalable ETL (Extract, Transform, Load) pipelines. Help me design a comprehensive ETL pipeline for the following scenario:
**Data Context:**
- Source systems: [LIST_OF_SOURCE_SYSTEMS, e.g., PostgreSQL, Salesforce API, CSV files from SFTP]
- Target destination: [TARGET_DATA_WAREHOUSE, e.g., Snowflake, BigQuery, Redshift]
- Data volume: approximately [DATA_VOLUME, e.g., 5 million rows/day]
- Update frequency: [FREQUENCY, e.g., hourly, daily, real-time]
- Primary business use case: [USE_CASE, e.g., marketing analytics, financial reporting]
**Please provide the following in your design:**
1. **Architecture Overview**: Recommend specific tools and technologies for each ETL stage (extraction, transformation, loading) and justify your choices based on the data volume and frequency requirements.
2. **Data Extraction Strategy**: Define the extraction method for each source system (full load vs. incremental load), including how to track changes (CDC, timestamps, etc.).
3. **Transformation Layer**: Outline the key transformation steps including data cleansing rules, deduplication logic, schema mapping, and any staging table structures needed. Provide example SQL or pseudocode for the most complex transformation.
4. **Loading Strategy**: Specify the loading pattern (upsert, append, truncate-and-reload) and explain partitioning or clustering strategies for optimal query performance.
5. **Error Handling & Monitoring**: Design a robust error-handling framework including retry logic, dead-letter queues, data quality checks, alerting mechanisms, and logging standards.
6. **Orchestration & Scheduling**: Recommend an orchestration tool and provide a DAG (Directed Acyclic Graph) structure showing task dependencies.
7. **Data Quality Gates**: Define at least 5 specific data quality checks (row count validation, null checks, referential integrity, etc.) that should run at each pipeline stage.
Format the output with clear headers, diagrams described in text, and include a summary table of tools recommended with estimated complexity ratings.
๐ก Tips for Better Results
Be as specific as possible about your source systems and data formats โ the more detail you provide, the more tailored the pipeline design will be.
Include any existing infrastructure or tool constraints (e.g., 'we already use Airflow' or 'must stay within AWS ecosystem') to get realistic recommendations.
Follow up by asking the AI to generate actual code templates for the most critical pipeline components.
๐ฏ Use Cases
Data engineers and architects who need to design or refactor ETL pipelines for new data warehouse implementations or migrations. Ideal during the planning phase of a data infrastructure project.