07 · Big Data Platforms: Distributed Computing, Orchestration, Scaling
Level: Intermediate to Advanced Time to read: 20 min After reading: You'll understand distributed processing (Hadoop, Spark), orchestration tools (Airflow, dbt), and how data warehouses scale.
The Scaling Problem
Single-server warehouse:
│
├─ PostgreSQL: ~500 GB, 1-5 second queries
└─ Problem: Can't hold terabytes, can't process petabytes
│
Single-server + external compute:
│
├─ Fact table: 100 GB (on warehouse)
├─ Processing: Apache Spark (100 servers)
└─ Architecture: Data in warehouse, compute scales separately
Distributed Computing: Spark vs. Hadoop
Apache Hadoop Ecosystem (Older, Still Used)
Components:
- HDFS (Hadoop Distributed File System) — stores data across many nodes
- MapReduce — batch processing framework (map phase → shuffle → reduce phase)
graph TB
A["Node 1<br/>Partition A"]
B["Node 2<br/>Partition B"]
C["Node 3<br/>Partition C"]
D["Shuffle &<br/>Sort"]
E["Reduce<br/>Output"]
A -->|Map| D
B -->|Map| D
C -->|Map| D
D --> E
Cons:
- ❌ Slow (restarts between map and reduce)
- ❌ Limited to Java/Scala
- ❌ Mostly replaced by Spark
Apache Spark (Modern, Preferred)
Advantages:
- ✅ In-memory processing (100x faster than Hadoop)
- ✅ Lazy evaluation (optimizes queries, reduces redundant computation)
- ✅ SQL interface (Spark SQL)
- ✅ Supports Python, Scala, Java, R, SQL
# Spark: Distributed data processing
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("warehouse").getOrCreate()
# Read from warehouse
df = spark.read \
.format("bigquery") \
.option("table", "project.dataset.fact_sales") \
.load()
# Transform (lazy — doesn't execute yet)
result = df \
.filter(df.sale_date >= "2024-01-01") \
.groupBy("customer_id") \
.agg(F.sum("amount").alias("total")) \
.filter(F.col("total") > 1000)
# Execute
result.show()
When to Use Spark
| Scenario | Use Spark? |
|---|---|
| Read 10 GB, process, write 100 MB | ✅ Yes (efficient) |
| ML/Data Science (train model on 1 TB) | ✅ Yes |
| Streaming (Kafka → data warehouse) | ✅ Yes |
| SQL aggregation (use BigQuery directly) | ❌ No (overkill) |
Data Pipelines: Orchestration Tools
Workflow Orchestration: Managing Dependencies
ETL Pipeline:
1. Extract from PostgreSQL (1 hour)
2. Extract from Kafka logs (2 hours)
3. Wait for both extracts to complete
4. Merge & transform (1 hour)
5. Load to Gold layer (30 min)
Problem: If step 3 fails, redeploy all or restart manually?
Solution: Orchestration tool manages dependencies and failure recovery.
Apache Airflow (Most Popular)
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.google.cloud.operators.bigquery import BigQueryCreateEmptyTableOperator
from datetime import datetime, timedelta
with DAG(
'daily_warehouse_refresh',
start_date=datetime(2024, 1, 1),
schedule_interval='0 2 * * *', # Daily at 2 AM
catchup=False,
) as dag:
extract_task = PythonOperator(
task_id='extract_from_postgres',
python_callable=extract_data,
provide_context=True,
)
transform_task = PythonOperator(
task_id='transform_silver',
python_callable=transform_data,
depends_on_past=False, # Independent retries
)
load_task = PythonOperator(
task_id='load_to_gold',
python_callable=load_to_gold,
)
# Define pipeline
extract_task >> transform_task >> load_task
Features:
- ✅ Task dependencies (>> operator)
- ✅ Retry logic (automatic backoff)
- ✅ Monitoring & alerting
- ✅ Web UI for viewing DAGs
- ❌ Steep learning curve, complex to debug
dbt (Data Build Tool, Preferred for SQL)
-- models/staging/stg_customers.sql
WITH source_data AS (
SELECT * FROM {{ source('raw', 'customers') }}
),
cleaned_data AS (
SELECT
customer_id,
LOWER(email) as email,
CASE
WHEN segment IS NULL THEN 'Unknown'
ELSE segment
END as segment,
created_date
FROM source_data
WHERE customer_id IS NOT NULL
),
final AS (
SELECT
{{ dbt_utils.surrogate_key(['customer_id']) }} as customer_key,
*
FROM cleaned_data
)
SELECT * FROM final
dbt Advantages:
- ✅ Pure SQL (no need for Python)
- ✅ Built-in testing framework
- ✅ Documentation generation
- ✅ Lineage tracking (know which models depend on which)
- ✅ Simpler than Airflow for SQL transforms
When to use:
- BI teams with SQL expertise
- Data warehouse-centric pipelines
- Want built-in testing/documentation
# dbt workflow
dbt run # Execute all models
dbt test # Run data quality tests
dbt docs generate # Auto-generate docs
dbt debug # Check warehouse connection
Comparison: Airflow vs. dbt
| Feature | Airflow | dbt |
|---|---|---|
| Language | Python (flexible) | SQL (simpler) |
| Orchestration | ✅ Yes | ⚠️ Basic (via dbt Cloud or Airflow) |
| Transformations | ✅ Any language | ✅ SQL only |
| Testing | Manual | ✅ Built-in tests |
| Documentation | Manual | ✅ Auto-generated |
| Lineage | ⚠️ Manual | ✅ Automatic |
| Learning curve | Steep | Gentle |
| Best for | Complex pipelines, ML | SQL transforms, BI teams |
Modern approach: Use Airflow to orchestrate, dbt for transformations.
Scaling Patterns
Pattern 1: Horizontal Scaling (Add More Nodes)
Spark cluster: 10 servers
├─ Driver (1 server): coordinates
├─ Executor (9 servers): process in parallel
└─ Data is distributed across all executors
Cost: Linear with cluster size
Pattern 2: Vertical Scaling (Bigger Machines)
Single warehouse: More CPU, More RAM
├─ Snowflake: Scale compute independently
├─ BigQuery: Slots (reserved compute per hour)
└─ Databricks: More workers per cluster
Cost: Diminishing returns (doubling resources ≠ 2x speed)
Pattern 3: Query Result Caching
Query 1: SELECT TOP 10 products by revenue (5 seconds, 10 GB scanned)
└─ Result cached in memory
Query 2: Same query (0.5 seconds, cache hit)
└─ Saves 10 GB scan and 4.5 seconds
BigQuery: 24-hour cache (free!)
Snowflake: Query result cache (free!)
Databricks: Percentage-based cache (smart storage)
Data Warehouse Scaling: Cloud-Native Databases
BigQuery (Google Cloud)
Model: Fully managed, separate storage/compute
Storage: 100 TB (Cloud Storage)
Compute: Query 1 PB (parallel SQL)
Pricing: $6.25/TB scanned (only pay for what you query)
Scaling: Automatic, transparent
Snowflake
Model: Shared storage (S3/GCS) + separate compute clusters
Storage: S3/GCS (shared, lower cost)
Compute: Snowflake clusters (scale independently)
Pricing: $1-4 per credit (1 credit ≈ 1 server⋅hour)
Scaling: Manual (specify cluster size)
Databricks
Model: Delta Lake (optimized Parquet) on cloud storage + Spark
Storage: Delta format (S3/GCS, optimized compression)
Compute: Spark clusters (on EC2/GCP/Azure)
Pricing: $0.30-0.50 per DBU-hour
Scaling: Cluster autoscaling (define min/max size)
Modern Data Stack Architecture
graph TB
A["Data Sources<br/>Apps, APIs, Logs"]
B["Cloud Storage<br/>S3, GCS, ADLS"]
C["Ingestion<br/>Fivetran, Stitch,<br/>Custom"]
D["Orchestration<br/>Airflow, Dagster"]
E["Transformation<br/>dbt, Spark SQL"]
F["Warehouse<br/>BigQuery,<br/>Snowflake"]
G["BI Layer<br/>Looker, Tableau,<br/>Metabase"]
A --> C
C --> B
B --> D
D --> E
E --> F
F --> G
Key Takeaways
✅ Spark for distributed processing, Airflow/dbt for orchestration.
✅ dbt is preferred for SQL transformations (built-in testing, documentation).
✅ Airflow for complex workflows with non-SQL steps.
✅ Horizontal scaling (more nodes) beats vertical (bigger machines).
✅ Cloud-native (BigQuery, Snowflake) handles scaling automatically.
✅ Query result caching provides free 10-100x speed for repeated queries.
Practice Questions
- When would you use Spark vs. BigQuery SQL?
- Why is dbt preferred over Airflow for SQL transforms?
- How does Spark's lazy evaluation help with performance?
- What's the cost advantage of cloud-native databases (BigQuery) vs. self-managed (Spark)?
- Design an Airflow DAG for: Extract → Transform (dbt) → Load.