04 · ETL & ELT: Pipelines, Patterns & Trade-Offs

Level: Intermediate to Advanced Time to read: 25 min Pre-reading: 01 · Foundations · 03 · Medallion After reading: You'll understand the difference between ETL and ELT, patterns for idempotency, late-arriving data, and tool choices.


The Fundamental Question: When to Transform?

ETL: Extract → Transform → Load

graph LR
    A["Operational<br/>System"]
    B["Extract<br/>(Full + Incremental)"]
    C["Transform<br/>(Central Server)"]
    D["Load<br/>(into Warehouse)"]
    E["Warehouse<br/>(Clean Data)"]

    A -->|"1. Extract"| B
    B -->|"2. Transform<br/>Normalize, Aggregate,<br/>Apply Business Logic"| C
    C -->|"3. Load"| D
    D --> E

Characteristics:

  • ✅ Transformation happens outside the warehouse (separate server/Airflow task)
  • ✅ Only clean data is loaded (smaller storage footprint)
  • ❌ Slower (all transformation before loading)
  • ❌ Tight coupling: changing logic requires redeployment
  • ✅ Better for small to medium data volumes

ELT: Extract → Load → Transform

graph LR
    A["Operational<br/>System"]
    B["Extract<br/>(Full + Incremental)"]
    C["Load<br/>(Raw into Warehouse)"]
    D["Transform<br/>(SQL/dbt in Warehouse)"]
    E["Warehouse<br/>(All Data + Clean)"]

    A -->|"1. Extract"| B
    B -->|"2. Load Raw<br/>No transformation"| C
    C -->|"3. Transform<br/>Dimensional models,<br/>Business logic"| D
    D --> E

Characteristics:

  • ✅ Load fast (raw data straight into warehouse)
  • ✅ Transformation inside warehouse (separate layer from raw)
  • ✅ Decoupled: changing SQL doesn't require redeployment
  • ✅ Better for big data volumes
  • ✅ Aligns with medallion architecture (Bronze → Silver → Gold)
  • ❌ Larger storage footprint (raw data kept)

ETL vs. ELT: Head-to-Head Comparison

Dimension ETL ELT
Transformation location External (server, Airflow) In warehouse (SQL)
Load volume Small (clean data only) Large (raw + clean)
Speed ❌ Slow (transform before load) ✅ Fast (load immediately)
Data coupling ⚠️ Tight (logic outside) ✅ Loose (logic in warehouse)
Scalability ❌ Limited (bottleneck outside) ✅ Unlimited (warehouse scales SQL)
Cost ⚠️ Higher compute (servers) ✅ Lower (leverage warehouse)
Storage ✅ Minimal (clean only) ⚠️ High (keep raw)
Debugging ❌ Hard (transform external) ✅ Easy (data in warehouse)
Best for Small data, strict schemas Big data, schema evolution
Tools Informatica, SSIS, Talend dbt, Dataflow, Spark SQL

When to Use Each

Use ETL If:

  • Data volume < 1 GB/day
  • Full transformations needed before loading (join 5+ sources, complex business logic)
  • Warehouse doesn't support the transformation language (legacy systems)
  • Cost of raw storage > cost of external compute

Examples:

  • Legacy financial systems
  • Small data marts
  • Highly normalized source schemas requiring major denormalization

Use ELT If:

  • Data volume > 1 GB/day (or rapidly growing)
  • Schema evolves (want to preserve raw for new requirements)
  • Warehouse excels at SQL (BigQuery, Snowflake, Databricks)
  • You want data debugging capability

Examples:

  • Modern SaaS analytics
  • Any BigQuery/Snowflake setup
  • Data lakes with medallion
  • Most current implementations

Pipeline Design Patterns

Pattern 1: Batch ETL/ELT (Most Common)

graph LR
    A["Operational<br/>DB"]
    B["Airflow DAG<br/>Daily at 2 AM"]
    C["Extract<br/>Incremental<br/>Last 24h"]
    D["Transform<br/>or Load-Transform"]
    E["Warehouse"]

    B -->|"Trigger"| C
    C -->|"Read new data"| A
    C --> D
    D -->|"Write"| E

Execution:

  • Daily/hourly schedule
  • Extract incremental changes (since last run)
  • Transform and load into warehouse
  • Entire pipeline runs atomically

Used by: Most companies, simple and reliable

Pattern 2: Streaming (Event-Driven)

graph LR
    A["Operational<br/>System"]
    B["Event Stream<br/>Kafka, Pub/Sub"]
    C["Streaming<br/>Consumer<br/>Flink, Spark"]
    D["Cloud Storage<br/>(Staging)"]
    E["Warehouse<br/>(Micro-batches)"]

    A -->|"Emit events<br/>in real-time"| B
    B --> C
    C -->|"Buffer 5-min<br/>windows"| D
    D -->|"Bulk insert<br/>every 5 min"| E

Characteristics:

  • Near real-time (latency: seconds to minutes)
  • Requires event-driven architecture at source
  • Complex to operate (debugging streaming is hard)

Used by: Ad-tech, finance, real-time dashboards

Pattern 3: Hybrid (Batch + Streaming)

Real-time events: Stream → Staging table (1-min micro-batches)
Batch corrections: Daily batch → Reconciliation + SCD Type 2

Used by: Large financial institutions, e-commerce

Pattern 4: CDC (Change Data Capture)

graph LR
    A["Operational<br/>DB"]
    B["CDC Tool<br/>Debezium,<br/>GCP Datastream"]
    C["Event Stream<br/>Kafka,<br/>Change Log"]
    D["Warehouse<br/>Apply Changes<br/>Upsert/Delete"]

    A -->|"Transaction Logs"| B
    B -->|"Parse deltas"| C
    C -->|"Apply in order"| D

Advantages:

  • ✅ Captures deletes (batch log extraction cannot)
  • ✅ Exactly-once semantics possible
  • ✅ Minimal impact on source system

Used by: Large-scale data pipelines (Shopify, Stripe)


Idempotency: The Foundation of Reliable Pipelines

The Problem: Retrying Without Duplicates

Run 1: Extract 1000 rows, Load to warehouse ✓
Run 2: Network fails after load, before commit
Run 3: Retry load...
  → Old system: Loads 1000 rows again (duplicates!)
  → Idempotent system: Detects already loaded, skips

Three Approaches to Idempotency

Approach 1: External State Tracking

# Airflow: Track last extracted timestamp
last_run_timestamp = get_state('customer_last_extract')
 SELECT * FROM customers WHERE updated_at > '{last_run_timestamp}'
 Load to warehouse
 set_state('customer_last_extract', current_timestamp)

# If load fails and we retry:
 Timestamp hasn't updated, so re-extraction gets same data
 Warehouse deduplication handles duplicates

Pros: Simple, works with any warehouse
Cons: Requires external state storage (not reliable for very large data)

Approach 2: Database Transactions

BEGIN TRANSACTION;
  DELETE FROM silver.customers WHERE date_loaded = TODAY();
  INSERT INTO silver.customers 
    SELECT * FROM bronze.customers
    WHERE date_loaded = TODAY();
COMMIT;

If transaction fails halfway:

  • Rollback restores clean state
  • Retry gets same result

Pros: Simple, reliable
Cons: Doesn't work for distributed transforms (Spark)

Approach 3: Idempotent Merge (SCD Type 2)

-- Merge is idempotent: running twice = running once
MERGE INTO gold.dim_customer t
USING silver.customers s
  ON t.customer_id = s.customer_id 
   AND t.is_current = TRUE
WHEN MATCHED AND (t.email != s.email) THEN
  UPDATE SET 
    is_current = FALSE,
    end_date = TODAY()
WHEN NOT MATCHED THEN
  INSERT (customer_id, email, ..., is_current)
  VALUES (s.customer_id, s.email, ..., TRUE);

Run twice:

  • First run: Inserts new, updates changed
  • Second run: Nothing new to insert, same matching rules apply, no duplicates

Pros: Handles concurrent updates, very robust
Cons: Requires warehouse supporting MERGE


Late-Arriving Data: Handling Out-of-Order Events

The Problem

Fact table loaded at 2 AM:
  Sale ID 12345: Date 2024-01-15, Amount 100

At 6 AM, source system correction arrives:
  Sale ID 12345: Actually Date 2024-01-10 (was in wrong day batch)

Problem: Daily aggregations for 2024-01-10 are already published!

Solution 1: Fact Table Reruns

-- Nightly 2 AM: Load daily fact
INSERT INTO gold.fact_sales
SELECT * FROM silver.sales 
WHERE sale_date = CURRENT_DATE() - 1;

-- Late-arriving: If date is old, delete & reload
MERGE INTO gold.fact_sales t
USING (
  SELECT * FROM silver.sales 
  WHERE sale_date BETWEEN CURRENT_DATE() - 7 AND CURRENT_DATE()
) s
  ON t.sale_key = s.sale_key
WHEN MATCHED THEN UPDATE SET t.* = s.*
WHEN NOT MATCHED THEN INSERT *;

Approach:

  • Reload N days of history (typically 3-7 days)
  • MERGE upsert handles late-arriving corrections
  • Any dependent reports re-run for those days

Solution 2: Fact Correction Tables

-- fact_sales: Original data
CREATE TABLE gold.fact_sales (...);

-- fact_sales_corrections: Override table for late changes
CREATE TABLE gold.fact_sales_corrections (
  sale_key STRING,
  new_amount DECIMAL,
  new_date DATE,
  reason STRING,
  effective_date DATE
);

-- Query layer: Union & apply corrections
SELECT 
  f.*,
  COALESCE(c.new_amount, f.amount) as amount_corrected,
  COALESCE(c.new_date, f.date) as date_corrected
FROM gold.fact_sales f
LEFT JOIN gold.fact_sales_corrections c 
  ON f.sale_key = c.sale_key 
  AND c.effective_date <= CURRENT_DATE();

Approach:

  • Keep original fact as-is
  • Corrections in separate table
  • Application layer (BI tool) applies corrections at query time

Solution 3: Event Time vs. Processing Time

-- Store both timestamps
CREATE TABLE silver.sales (
  event_time TIMESTAMP,     -- When sale happened
  processing_time TIMESTAMP, -- When we received it

  -- Allows late-arriving analysis
) CLUSTER BY event_time;

-- Late-arriving: Data with old event_time, new processing_time
SELECT 
  DATE(event_time) as sale_date,
  COUNT(*) as count_by_event_time,
  COUNT(CASE WHEN processing_time > DATE(event_time) + 1 DAY 
    THEN 1 END) as late_arriving_count
FROM silver.sales
GROUP BY 1;

Data Quality in Pipelines

Quality Gates at Each Stage

Bronze (Raw) → Quality Check → Silver (Clean) → Validation → Gold (Ready)
                    ↓                              ↓              ↓
            - Duplicates?        - Nulls expected?   - FK valid?
            - Schema mismatch?   - Out of range?     - Aggregates match?
            - Malformed dates?   - Format correct?   - No orphans?

Example: Quality Validation Framework (dbt tests)

# dbt: Declarative quality checks
version: 2

models:
  - name: silver_customers
    columns:
      - name: customer_id
        tests:
          - unique         # No duplicates
          - not_null       # Always present

      - name: email
        tests:
          - unique
          - not_null
          - regex_match:
              pattern: '^.+@.+\..+$'  # Valid email format

      - name: age
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0
              max_value: 150

Deep-Dives

Deep Dive: ETL Patterns — Slowly Changing Dimensions (SCD), handling duplicates, maintaining history.

Deep Dive: Data Pipelines — Orchestration tools (Airflow, dbt), monitoring, alerting, debugging failed runs.


Key Takeaways

ELT (modern) is preferred for big data; ETL (legacy) for controlled transformations.

Idempotency ensures retries produce same result without duplicates.

Late-arriving data requires either rerun windows, correction tables, or event-time tracking.

Quality gates at each medallion layer prevent garbage-in-garbage-out.

Streaming is faster but operationally complex; batch is simpler and sufficient for most use cases.


Practice Questions

  1. When would you choose ETL over ELT?
  2. How does MERGE statement provide idempotency?
  3. What's the difference between event time and processing time?
  4. How would you handle a fact arriving 10 days late in a daily pipeline?
  5. Design a data quality test for a slowly changing dimension.