07 · Big Data Platforms: Distributed Computing, Orchestration, Scaling

Level: Intermediate to Advanced Time to read: 20 min After reading: You'll understand distributed processing (Hadoop, Spark), orchestration tools (Airflow, dbt), and how data warehouses scale.


The Scaling Problem

Single-server warehouse:
├─ PostgreSQL: ~500 GB, 1-5 second queries
└─ Problem: Can't hold terabytes, can't process petabytes
Single-server + external compute:
├─ Fact table: 100 GB (on warehouse)
├─ Processing: Apache Spark (100 servers)
└─ Architecture: Data in warehouse, compute scales separately

Distributed Computing: Spark vs. Hadoop

Apache Hadoop Ecosystem (Older, Still Used)

Components:

  • HDFS (Hadoop Distributed File System) — stores data across many nodes
  • MapReduce — batch processing framework (map phase → shuffle → reduce phase)
graph TB
    A["Node 1<br/>Partition A"]
    B["Node 2<br/>Partition B"]
    C["Node 3<br/>Partition C"]
    D["Shuffle &<br/>Sort"]
    E["Reduce<br/>Output"]

    A -->|Map| D
    B -->|Map| D
    C -->|Map| D
    D --> E

Cons:

  • ❌ Slow (restarts between map and reduce)
  • ❌ Limited to Java/Scala
  • ❌ Mostly replaced by Spark

Apache Spark (Modern, Preferred)

Advantages:

  • ✅ In-memory processing (100x faster than Hadoop)
  • ✅ Lazy evaluation (optimizes queries, reduces redundant computation)
  • ✅ SQL interface (Spark SQL)
  • ✅ Supports Python, Scala, Java, R, SQL
# Spark: Distributed data processing
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("warehouse").getOrCreate()

# Read from warehouse
df = spark.read \
    .format("bigquery") \
    .option("table", "project.dataset.fact_sales") \
    .load()

# Transform (lazy — doesn't execute yet)
result = df \
    .filter(df.sale_date >= "2024-01-01") \
    .groupBy("customer_id") \
    .agg(F.sum("amount").alias("total")) \
    .filter(F.col("total") > 1000)

# Execute
result.show()

When to Use Spark

Scenario Use Spark?
Read 10 GB, process, write 100 MB ✅ Yes (efficient)
ML/Data Science (train model on 1 TB) ✅ Yes
Streaming (Kafka → data warehouse) ✅ Yes
SQL aggregation (use BigQuery directly) ❌ No (overkill)

Data Pipelines: Orchestration Tools

Workflow Orchestration: Managing Dependencies

ETL Pipeline:
   1. Extract from PostgreSQL (1 hour)
   2. Extract from Kafka logs (2 hours)
   3. Wait for both extracts to complete
   4. Merge & transform (1 hour)
   5. Load to Gold layer (30 min)

Problem: If step 3 fails, redeploy all or restart manually?
Solution: Orchestration tool manages dependencies and failure recovery.
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.google.cloud.operators.bigquery import BigQueryCreateEmptyTableOperator
from datetime import datetime, timedelta

with DAG(
    'daily_warehouse_refresh',
    start_date=datetime(2024, 1, 1),
    schedule_interval='0 2 * * *',  # Daily at 2 AM
    catchup=False,
) as dag:

    extract_task = PythonOperator(
        task_id='extract_from_postgres',
        python_callable=extract_data,
        provide_context=True,
    )

    transform_task = PythonOperator(
        task_id='transform_silver',
        python_callable=transform_data,
        depends_on_past=False,  # Independent retries
    )

    load_task = PythonOperator(
        task_id='load_to_gold',
        python_callable=load_to_gold,
    )

    # Define pipeline
    extract_task >> transform_task >> load_task

Features:

  • ✅ Task dependencies (>> operator)
  • ✅ Retry logic (automatic backoff)
  • ✅ Monitoring & alerting
  • ✅ Web UI for viewing DAGs
  • ❌ Steep learning curve, complex to debug

dbt (Data Build Tool, Preferred for SQL)

-- models/staging/stg_customers.sql

WITH source_data AS (
    SELECT * FROM {{ source('raw', 'customers') }}
),

cleaned_data AS (
    SELECT 
        customer_id,
        LOWER(email) as email,
        CASE 
            WHEN segment IS NULL THEN 'Unknown'
            ELSE segment
        END as segment,
        created_date
    FROM source_data
    WHERE customer_id IS NOT NULL
),

final AS (
    SELECT 
        {{ dbt_utils.surrogate_key(['customer_id']) }} as customer_key,
        *
    FROM cleaned_data
)

SELECT * FROM final

dbt Advantages:

  • ✅ Pure SQL (no need for Python)
  • ✅ Built-in testing framework
  • ✅ Documentation generation
  • ✅ Lineage tracking (know which models depend on which)
  • ✅ Simpler than Airflow for SQL transforms

When to use:

  • BI teams with SQL expertise
  • Data warehouse-centric pipelines
  • Want built-in testing/documentation
# dbt workflow
dbt run           # Execute all models
dbt test          # Run data quality tests
dbt docs generate # Auto-generate docs
dbt debug         # Check warehouse connection

Comparison: Airflow vs. dbt

Feature Airflow dbt
Language Python (flexible) SQL (simpler)
Orchestration ✅ Yes ⚠️ Basic (via dbt Cloud or Airflow)
Transformations ✅ Any language ✅ SQL only
Testing Manual ✅ Built-in tests
Documentation Manual ✅ Auto-generated
Lineage ⚠️ Manual ✅ Automatic
Learning curve Steep Gentle
Best for Complex pipelines, ML SQL transforms, BI teams

Modern approach: Use Airflow to orchestrate, dbt for transformations.


Scaling Patterns

Pattern 1: Horizontal Scaling (Add More Nodes)

Spark cluster: 10 servers
├─ Driver (1 server): coordinates
├─ Executor (9 servers): process in parallel
└─ Data is distributed across all executors

Cost: Linear with cluster size

Pattern 2: Vertical Scaling (Bigger Machines)

Single warehouse: More CPU, More RAM
├─ Snowflake: Scale compute independently
├─ BigQuery: Slots (reserved compute per hour)
└─ Databricks: More workers per cluster

Cost: Diminishing returns (doubling resources ≠ 2x speed)

Pattern 3: Query Result Caching

Query 1: SELECT TOP 10 products by revenue (5 seconds, 10 GB scanned)
      └─ Result cached in memory
Query 2: Same query (0.5 seconds, cache hit)
      └─ Saves 10 GB scan and 4.5 seconds

BigQuery: 24-hour cache (free!)
Snowflake: Query result cache (free!)
Databricks: Percentage-based cache (smart storage)


Data Warehouse Scaling: Cloud-Native Databases

BigQuery (Google Cloud)

Model: Fully managed, separate storage/compute

Storage: 100 TB (Cloud Storage)
Compute: Query 1 PB (parallel SQL)
Pricing: $6.25/TB scanned (only pay for what you query)

Scaling: Automatic, transparent

Snowflake

Model: Shared storage (S3/GCS) + separate compute clusters

Storage: S3/GCS (shared, lower cost)
Compute: Snowflake clusters (scale independently)
Pricing: $1-4 per credit (1 credit ≈ 1 server⋅hour)
Scaling: Manual (specify cluster size)

Databricks

Model: Delta Lake (optimized Parquet) on cloud storage + Spark

Storage: Delta format (S3/GCS, optimized compression)
Compute: Spark clusters (on EC2/GCP/Azure)
Pricing: $0.30-0.50 per DBU-hour
Scaling: Cluster autoscaling (define min/max size)

Modern Data Stack Architecture

graph TB
    A["Data Sources<br/>Apps, APIs, Logs"]
    B["Cloud Storage<br/>S3, GCS, ADLS"]
    C["Ingestion<br/>Fivetran, Stitch,<br/>Custom"]
    D["Orchestration<br/>Airflow, Dagster"]
    E["Transformation<br/>dbt, Spark SQL"]
    F["Warehouse<br/>BigQuery,<br/>Snowflake"]
    G["BI Layer<br/>Looker, Tableau,<br/>Metabase"]

    A --> C
    C --> B
    B --> D
    D --> E
    E --> F
    F --> G

Key Takeaways

Spark for distributed processing, Airflow/dbt for orchestration.

dbt is preferred for SQL transformations (built-in testing, documentation).

Airflow for complex workflows with non-SQL steps.

Horizontal scaling (more nodes) beats vertical (bigger machines).

Cloud-native (BigQuery, Snowflake) handles scaling automatically.

Query result caching provides free 10-100x speed for repeated queries.


Practice Questions

  1. When would you use Spark vs. BigQuery SQL?
  2. Why is dbt preferred over Airflow for SQL transforms?
  3. How does Spark's lazy evaluation help with performance?
  4. What's the cost advantage of cloud-native databases (BigQuery) vs. self-managed (Spark)?
  5. Design an Airflow DAG for: Extract → Transform (dbt) → Load.