Design a Robust ETL Pipeline Architecture for Your Data Platform
Design a complete ETL pipeline architecture with extraction, transformation, loading strategies, error handling, and governance.
๐ The Prompt
You are a senior data engineer with 10+ years of experience designing scalable ETL (Extract, Transform, Load) pipelines. I need you to design a comprehensive ETL pipeline architecture for the following scenario:
**Data Context:**
- Source systems: [LIST_OF_SOURCE_SYSTEMS, e.g., PostgreSQL, REST APIs, CSV files]
- Target destination: [TARGET_DATA_WAREHOUSE, e.g., Snowflake, BigQuery, Redshift]
- Data volume: [ESTIMATED_DAILY_DATA_VOLUME, e.g., 50GB/day]
- Refresh frequency: [BATCH_OR_STREAMING, e.g., hourly batch, real-time streaming]
- Primary use case: [BUSINESS_USE_CASE, e.g., customer analytics dashboard, financial reporting]
**Please provide the following in your design:**
1. **Architecture Diagram Description**: Describe the end-to-end pipeline flow, including each component and how they connect. Specify which tools or services you recommend (e.g., Apache Airflow, dbt, Kafka, Fivetran) and justify each choice.
2. **Extraction Layer**: Detail the extraction strategy for each source system, including connection methods, incremental vs. full load logic, and change data capture (CDC) considerations.
3. **Transformation Layer**: Outline the transformation steps including data cleansing, deduplication, schema mapping, business logic application, and dimensional modeling approach (star schema, snowflake schema, or OBT).
4. **Loading Strategy**: Specify the loading pattern (upsert, append, truncate-and-reload), partitioning strategy, and indexing recommendations for the target warehouse.
5. **Error Handling & Monitoring**: Design a robust error-handling framework including retry logic, dead-letter queues, data validation checkpoints, alerting mechanisms, and logging standards.
6. **Scalability & Performance**: Address how the pipeline handles data volume growth, parallel processing, resource optimization, and backfill scenarios.
7. **Data Governance**: Include lineage tracking, metadata management, access controls, and PII handling procedures.
Format the output with clear headers, bullet points, and include a sample DAG or workflow definition in pseudocode where applicable.
๐ก Tips for Better Results
Be as specific as possible about your source systems and data formats to get tailored extraction strategies rather than generic advice.
Include any existing infrastructure constraints (e.g., cloud provider, budget limits, team skill set) so the design is realistic and implementable.
Mention compliance requirements like GDPR or HIPAA upfront so PII handling and governance are baked into the design from the start.
๐ฏ Use Cases
Data engineers and architects should use this when planning a new ETL pipeline or modernizing a legacy data integration workflow to ensure comprehensive coverage of all critical design considerations.