Design a Robust ETL Pipeline Architecture for Your Data Platform

Design a complete ETL pipeline architecture with extraction, transformation, loading strategies, error handling, and governance.

๐Ÿ“ The Prompt

You are a senior data engineer with 10+ years of experience designing scalable ETL (Extract, Transform, Load) pipelines. I need you to design a comprehensive ETL pipeline architecture for the following scenario: **Data Context:** - Source systems: [LIST_OF_SOURCE_SYSTEMS, e.g., PostgreSQL, REST APIs, CSV files] - Target destination: [TARGET_DATA_WAREHOUSE, e.g., Snowflake, BigQuery, Redshift] - Data volume: [ESTIMATED_DAILY_DATA_VOLUME, e.g., 50GB/day] - Refresh frequency: [BATCH_OR_STREAMING, e.g., hourly batch, real-time streaming] - Primary use case: [BUSINESS_USE_CASE, e.g., customer analytics dashboard, financial reporting] **Please provide the following in your design:** 1. **Architecture Diagram Description**: Describe the end-to-end pipeline flow, including each component and how they connect. Specify which tools or services you recommend (e.g., Apache Airflow, dbt, Kafka, Fivetran) and justify each choice. 2. **Extraction Layer**: Detail the extraction strategy for each source system, including connection methods, incremental vs. full load logic, and change data capture (CDC) considerations. 3. **Transformation Layer**: Outline the transformation steps including data cleansing, deduplication, schema mapping, business logic application, and dimensional modeling approach (star schema, snowflake schema, or OBT). 4. **Loading Strategy**: Specify the loading pattern (upsert, append, truncate-and-reload), partitioning strategy, and indexing recommendations for the target warehouse. 5. **Error Handling & Monitoring**: Design a robust error-handling framework including retry logic, dead-letter queues, data validation checkpoints, alerting mechanisms, and logging standards. 6. **Scalability & Performance**: Address how the pipeline handles data volume growth, parallel processing, resource optimization, and backfill scenarios. 7. **Data Governance**: Include lineage tracking, metadata management, access controls, and PII handling procedures. Format the output with clear headers, bullet points, and include a sample DAG or workflow definition in pseudocode where applicable.

๐Ÿ’ก Tips for Better Results

Be as specific as possible about your source systems and data formats to get tailored extraction strategies rather than generic advice. Include any existing infrastructure constraints (e.g., cloud provider, budget limits, team skill set) so the design is realistic and implementable. Mention compliance requirements like GDPR or HIPAA upfront so PII handling and governance are baked into the design from the start.

๐ŸŽฏ Use Cases

Data engineers and architects should use this when planning a new ETL pipeline or modernizing a legacy data integration workflow to ensure comprehensive coverage of all critical design considerations.

๐Ÿ”— Related Prompts

๐Ÿ“Š Data & Analytics intermediate

Write Complex SQL Queries

Generate optimized SQL queries for complex analysis with CTEs, JOINs, and performance tips.

๐Ÿ“Š Data & Analytics intermediate

Python Data Analysis Script

Generate a complete Python data analysis pipeline with cleaning, visualization, and insights.

๐Ÿ“Š Data & Analytics intermediate

Create a Comprehensive Data Quality Checklist for Your Dataset

Generate a tailored data quality checklist with SQL validation queries, severity levels, and a scoring framework for any dataset.

๐Ÿ“Š Data & Analytics advanced

Analyze and Interpret A/B Test Results with Statistical Rigor

Get a complete A/B test analysis with statistical significance, power analysis, validity checks, and a clear ship decision.

๐Ÿ“Š Data & Analytics intermediate

Analyze A/B Test Results and Generate Statistical Recommendations

Get a complete A/B test analysis with statistical significance, power analysis, sanity checks, and ship/no-ship recommendations.

๐Ÿ“Š Data & Analytics intermediate

Build an RFM Customer Segmentation Model for Targeted Marketing

Create a complete RFM customer segmentation model with scoring logic, code implementation, and marketing strategies.