Design a Scalable ETL Pipeline Architecture

Design a robust, scalable ETL pipeline architecture with extraction, transformation, loading strategies, and error handling.

๐Ÿ“ The Prompt

You are a senior data engineer specializing in ETL (Extract, Transform, Load) pipeline design. Help me design a comprehensive ETL pipeline for the following scenario: **Data Context:** - Source systems: [LIST_SOURCE_SYSTEMS, e.g., PostgreSQL, REST APIs, CSV files] - Target data warehouse/lake: [TARGET_PLATFORM, e.g., Snowflake, BigQuery, Redshift] - Estimated data volume: [DATA_VOLUME, e.g., 5 million rows/day] - Update frequency: [FREQUENCY, e.g., hourly, daily, real-time] - Primary business use case: [USE_CASE, e.g., customer analytics, financial reporting] **Please provide the following in your design:** 1. **Architecture Overview:** Describe the end-to-end pipeline architecture, including extraction strategy (full vs. incremental), recommended orchestration tool (e.g., Airflow, Prefect, dbt), and data flow diagram description. 2. **Extraction Layer:** Detail how data should be extracted from each source system, including connection methods, change data capture (CDC) strategies, and handling of API rate limits or file polling. 3. **Transformation Layer:** Outline the transformation steps including data cleansing rules, deduplication logic, schema mapping, data type standardization, and any business logic transformations. Specify whether transformations should happen in-flight (ELT) or in a staging area (ETL). 4. **Loading Strategy:** Recommend the loading pattern (upsert, append, overwrite), partitioning strategy, and indexing recommendations for the target platform. 5. **Error Handling & Monitoring:** Define retry logic, dead-letter queue design, data quality checks (row counts, null checks, referential integrity), alerting mechanisms, and logging standards. 6. **Scalability & Performance:** Provide recommendations for parallelism, batching strategies, memory optimization, and how the pipeline should handle 10x data growth. 7. **Sample Code Skeleton:** Provide a Python or SQL pseudo-code skeleton for the most complex transformation step. Format the output with clear headings, bullet points, and include a summary table of tools and technologies recommended.

๐Ÿ’ก Tips for Better Results

Be specific about your source systems and data formats to get more tailored extraction strategies Include any compliance requirements (GDPR, HIPAA) as they significantly affect pipeline design Mention your team's tech stack familiarity so recommendations align with existing skills

๐ŸŽฏ Use Cases

Data engineers and architects designing new data pipelines or modernizing legacy ETL processes for scalable analytics infrastructure.

๐Ÿ”— Related Prompts

๐Ÿ“Š Data & Analytics intermediate

Write Complex SQL Queries

Generate optimized SQL queries for complex analysis with CTEs, JOINs, and performance tips.

๐Ÿ“Š Data & Analytics intermediate

Python Data Analysis Script

Generate a complete Python data analysis pipeline with cleaning, visualization, and insights.

๐Ÿ“Š Data & Analytics intermediate

Build an RFM Customer Segmentation Model for Targeted Marketing

Create a complete RFM customer segmentation model with scoring logic, code implementation, and marketing strategies.

๐Ÿ“Š Data & Analytics advanced

Design a Robust ETL Pipeline Architecture for Your Data Platform

Design a complete ETL pipeline architecture with extraction, transformation, loading strategies, error handling, and governance.

๐Ÿ“Š Data & Analytics intermediate

Create a Comprehensive Data Quality Checklist for Your Dataset

Generate a tailored data quality checklist with SQL validation queries, severity levels, and a scoring framework for any dataset.

๐Ÿ“Š Data & Analytics advanced

Analyze and Interpret A/B Test Results with Statistical Rigor

Get a complete A/B test analysis with statistical significance, power analysis, validity checks, and a clear ship decision.