04 · ETL & ELT: Pipelines, Patterns & Trade-Offs
Level: Intermediate to Advanced Time to read: 25 min Pre-reading: 01 · Foundations · 03 · Medallion After reading: You'll understand the difference between ETL and ELT, patterns for idempotency, late-arriving data, and tool choices.
The Fundamental Question: When to Transform?
ETL: Extract → Transform → Load
graph LR
A["Operational<br/>System"]
B["Extract<br/>(Full + Incremental)"]
C["Transform<br/>(Central Server)"]
D["Load<br/>(into Warehouse)"]
E["Warehouse<br/>(Clean Data)"]
A -->|"1. Extract"| B
B -->|"2. Transform<br/>Normalize, Aggregate,<br/>Apply Business Logic"| C
C -->|"3. Load"| D
D --> E
Characteristics:
- ✅ Transformation happens outside the warehouse (separate server/Airflow task)
- ✅ Only clean data is loaded (smaller storage footprint)
- ❌ Slower (all transformation before loading)
- ❌ Tight coupling: changing logic requires redeployment
- ✅ Better for small to medium data volumes
ELT: Extract → Load → Transform
graph LR
A["Operational<br/>System"]
B["Extract<br/>(Full + Incremental)"]
C["Load<br/>(Raw into Warehouse)"]
D["Transform<br/>(SQL/dbt in Warehouse)"]
E["Warehouse<br/>(All Data + Clean)"]
A -->|"1. Extract"| B
B -->|"2. Load Raw<br/>No transformation"| C
C -->|"3. Transform<br/>Dimensional models,<br/>Business logic"| D
D --> E
Characteristics:
- ✅ Load fast (raw data straight into warehouse)
- ✅ Transformation inside warehouse (separate layer from raw)
- ✅ Decoupled: changing SQL doesn't require redeployment
- ✅ Better for big data volumes
- ✅ Aligns with medallion architecture (Bronze → Silver → Gold)
- ❌ Larger storage footprint (raw data kept)
ETL vs. ELT: Head-to-Head Comparison
| Dimension | ETL | ELT |
|---|---|---|
| Transformation location | External (server, Airflow) | In warehouse (SQL) |
| Load volume | Small (clean data only) | Large (raw + clean) |
| Speed | ❌ Slow (transform before load) | ✅ Fast (load immediately) |
| Data coupling | ⚠️ Tight (logic outside) | ✅ Loose (logic in warehouse) |
| Scalability | ❌ Limited (bottleneck outside) | ✅ Unlimited (warehouse scales SQL) |
| Cost | ⚠️ Higher compute (servers) | ✅ Lower (leverage warehouse) |
| Storage | ✅ Minimal (clean only) | ⚠️ High (keep raw) |
| Debugging | ❌ Hard (transform external) | ✅ Easy (data in warehouse) |
| Best for | Small data, strict schemas | Big data, schema evolution |
| Tools | Informatica, SSIS, Talend | dbt, Dataflow, Spark SQL |
When to Use Each
Use ETL If:
- Data volume < 1 GB/day
- Full transformations needed before loading (join 5+ sources, complex business logic)
- Warehouse doesn't support the transformation language (legacy systems)
- Cost of raw storage > cost of external compute
Examples:
- Legacy financial systems
- Small data marts
- Highly normalized source schemas requiring major denormalization
Use ELT If:
- Data volume > 1 GB/day (or rapidly growing)
- Schema evolves (want to preserve raw for new requirements)
- Warehouse excels at SQL (BigQuery, Snowflake, Databricks)
- You want data debugging capability
Examples:
- Modern SaaS analytics
- Any BigQuery/Snowflake setup
- Data lakes with medallion
- Most current implementations
Pipeline Design Patterns
Pattern 1: Batch ETL/ELT (Most Common)
graph LR
A["Operational<br/>DB"]
B["Airflow DAG<br/>Daily at 2 AM"]
C["Extract<br/>Incremental<br/>Last 24h"]
D["Transform<br/>or Load-Transform"]
E["Warehouse"]
B -->|"Trigger"| C
C -->|"Read new data"| A
C --> D
D -->|"Write"| E
Execution:
- Daily/hourly schedule
- Extract incremental changes (since last run)
- Transform and load into warehouse
- Entire pipeline runs atomically
Used by: Most companies, simple and reliable
Pattern 2: Streaming (Event-Driven)
graph LR
A["Operational<br/>System"]
B["Event Stream<br/>Kafka, Pub/Sub"]
C["Streaming<br/>Consumer<br/>Flink, Spark"]
D["Cloud Storage<br/>(Staging)"]
E["Warehouse<br/>(Micro-batches)"]
A -->|"Emit events<br/>in real-time"| B
B --> C
C -->|"Buffer 5-min<br/>windows"| D
D -->|"Bulk insert<br/>every 5 min"| E
Characteristics:
- Near real-time (latency: seconds to minutes)
- Requires event-driven architecture at source
- Complex to operate (debugging streaming is hard)
Used by: Ad-tech, finance, real-time dashboards
Pattern 3: Hybrid (Batch + Streaming)
Real-time events: Stream → Staging table (1-min micro-batches)
Batch corrections: Daily batch → Reconciliation + SCD Type 2
Used by: Large financial institutions, e-commerce
Pattern 4: CDC (Change Data Capture)
graph LR
A["Operational<br/>DB"]
B["CDC Tool<br/>Debezium,<br/>GCP Datastream"]
C["Event Stream<br/>Kafka,<br/>Change Log"]
D["Warehouse<br/>Apply Changes<br/>Upsert/Delete"]
A -->|"Transaction Logs"| B
B -->|"Parse deltas"| C
C -->|"Apply in order"| D
Advantages:
- ✅ Captures deletes (batch log extraction cannot)
- ✅ Exactly-once semantics possible
- ✅ Minimal impact on source system
Used by: Large-scale data pipelines (Shopify, Stripe)
Idempotency: The Foundation of Reliable Pipelines
The Problem: Retrying Without Duplicates
Run 1: Extract 1000 rows, Load to warehouse ✓
Run 2: Network fails after load, before commit
Run 3: Retry load...
→ Old system: Loads 1000 rows again (duplicates!)
→ Idempotent system: Detects already loaded, skips
Three Approaches to Idempotency
Approach 1: External State Tracking
# Airflow: Track last extracted timestamp
last_run_timestamp = get_state('customer_last_extract')
→ SELECT * FROM customers WHERE updated_at > '{last_run_timestamp}'
→ Load to warehouse
→ set_state('customer_last_extract', current_timestamp)
# If load fails and we retry:
→ Timestamp hasn't updated, so re-extraction gets same data
→ Warehouse deduplication handles duplicates
Pros: Simple, works with any warehouse
Cons: Requires external state storage (not reliable for very large data)
Approach 2: Database Transactions
BEGIN TRANSACTION;
DELETE FROM silver.customers WHERE date_loaded = TODAY();
INSERT INTO silver.customers
SELECT * FROM bronze.customers
WHERE date_loaded = TODAY();
COMMIT;
If transaction fails halfway:
- Rollback restores clean state
- Retry gets same result
Pros: Simple, reliable
Cons: Doesn't work for distributed transforms (Spark)
Approach 3: Idempotent Merge (SCD Type 2)
-- Merge is idempotent: running twice = running once
MERGE INTO gold.dim_customer t
USING silver.customers s
ON t.customer_id = s.customer_id
AND t.is_current = TRUE
WHEN MATCHED AND (t.email != s.email) THEN
UPDATE SET
is_current = FALSE,
end_date = TODAY()
WHEN NOT MATCHED THEN
INSERT (customer_id, email, ..., is_current)
VALUES (s.customer_id, s.email, ..., TRUE);
Run twice:
- First run: Inserts new, updates changed
- Second run: Nothing new to insert, same matching rules apply, no duplicates
Pros: Handles concurrent updates, very robust
Cons: Requires warehouse supporting MERGE
Late-Arriving Data: Handling Out-of-Order Events
The Problem
Fact table loaded at 2 AM:
Sale ID 12345: Date 2024-01-15, Amount 100
At 6 AM, source system correction arrives:
Sale ID 12345: Actually Date 2024-01-10 (was in wrong day batch)
Problem: Daily aggregations for 2024-01-10 are already published!
Solution 1: Fact Table Reruns
-- Nightly 2 AM: Load daily fact
INSERT INTO gold.fact_sales
SELECT * FROM silver.sales
WHERE sale_date = CURRENT_DATE() - 1;
-- Late-arriving: If date is old, delete & reload
MERGE INTO gold.fact_sales t
USING (
SELECT * FROM silver.sales
WHERE sale_date BETWEEN CURRENT_DATE() - 7 AND CURRENT_DATE()
) s
ON t.sale_key = s.sale_key
WHEN MATCHED THEN UPDATE SET t.* = s.*
WHEN NOT MATCHED THEN INSERT *;
Approach:
- Reload N days of history (typically 3-7 days)
- MERGE upsert handles late-arriving corrections
- Any dependent reports re-run for those days
Solution 2: Fact Correction Tables
-- fact_sales: Original data
CREATE TABLE gold.fact_sales (...);
-- fact_sales_corrections: Override table for late changes
CREATE TABLE gold.fact_sales_corrections (
sale_key STRING,
new_amount DECIMAL,
new_date DATE,
reason STRING,
effective_date DATE
);
-- Query layer: Union & apply corrections
SELECT
f.*,
COALESCE(c.new_amount, f.amount) as amount_corrected,
COALESCE(c.new_date, f.date) as date_corrected
FROM gold.fact_sales f
LEFT JOIN gold.fact_sales_corrections c
ON f.sale_key = c.sale_key
AND c.effective_date <= CURRENT_DATE();
Approach:
- Keep original fact as-is
- Corrections in separate table
- Application layer (BI tool) applies corrections at query time
Solution 3: Event Time vs. Processing Time
-- Store both timestamps
CREATE TABLE silver.sales (
event_time TIMESTAMP, -- When sale happened
processing_time TIMESTAMP, -- When we received it
-- Allows late-arriving analysis
) CLUSTER BY event_time;
-- Late-arriving: Data with old event_time, new processing_time
SELECT
DATE(event_time) as sale_date,
COUNT(*) as count_by_event_time,
COUNT(CASE WHEN processing_time > DATE(event_time) + 1 DAY
THEN 1 END) as late_arriving_count
FROM silver.sales
GROUP BY 1;
Data Quality in Pipelines
Quality Gates at Each Stage
Bronze (Raw) → Quality Check → Silver (Clean) → Validation → Gold (Ready)
↓ ↓ ↓
- Duplicates? - Nulls expected? - FK valid?
- Schema mismatch? - Out of range? - Aggregates match?
- Malformed dates? - Format correct? - No orphans?
Example: Quality Validation Framework (dbt tests)
# dbt: Declarative quality checks
version: 2
models:
- name: silver_customers
columns:
- name: customer_id
tests:
- unique # No duplicates
- not_null # Always present
- name: email
tests:
- unique
- not_null
- regex_match:
pattern: '^.+@.+\..+$' # Valid email format
- name: age
tests:
- not_null
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 150
Deep-Dives
→ Deep Dive: ETL Patterns — Slowly Changing Dimensions (SCD), handling duplicates, maintaining history.
→ Deep Dive: Data Pipelines — Orchestration tools (Airflow, dbt), monitoring, alerting, debugging failed runs.
Key Takeaways
✅ ELT (modern) is preferred for big data; ETL (legacy) for controlled transformations.
✅ Idempotency ensures retries produce same result without duplicates.
✅ Late-arriving data requires either rerun windows, correction tables, or event-time tracking.
✅ Quality gates at each medallion layer prevent garbage-in-garbage-out.
✅ Streaming is faster but operationally complex; batch is simpler and sufficient for most use cases.
Practice Questions
- When would you choose ETL over ELT?
- How does MERGE statement provide idempotency?
- What's the difference between event time and processing time?
- How would you handle a fact arriving 10 days late in a daily pipeline?
- Design a data quality test for a slowly changing dimension.