06 · Query Optimization: Making Data Warehouses Fast

Level: Intermediate to Advanced Time to read: 20 min Pre-reading: 02 · Warehouse Architecture After reading: You'll understand columnar storage, partitioning, clustering, indexes, and cost optimization strategies that make analytical queries fast.


The Cost Equation: Speed × Dollars

Every optimization decision in a data warehouse is a trade-off:

\[\text{Query Cost} = (\text{Data Scanned} \times \text{Cost per GB}) + (\text{Compute Time} \times \text{Compute Cost})\]

Examples:

  • BigQuery: $6.25 per TB scanned (+ $0.04 per compute second)
  • Snowflake: $1-4 per credit (1 credit ≈ 60 GB processed)
  • Databricks: $0.30-0.50 per DBU-hour

Optimizing queries directly reduces cost and improves speed.


Optimization Strategy: Column Selection (Columnar Storage)

The Problem: Row-Oriented Databases

-- Fact table (row-oriented OLTP)
CREATE TABLE fact_sales (
  sale_id INT,          -- 4 bytes
  customer_id INT,      -- 4 bytes
  product_id INT,       -- 4 bytes
  store_id INT,         -- 4 bytes
  amount DECIMAL,       -- 16 bytes
  quantity INT,         -- 4 bytes
  cost DECIMAL,         -- 16 bytes
  discount DECIMAL,     -- 16 bytes
  -- Total: ~64 bytes per row
);

-- ROW STORAGE: Like a spreadsheet row
| 1001 | 5432 | 789 | 2 | 99.99 | 1 | 50.00 | 5.00 |
| 1002 | 5433 | 790 | 2 | 149.99| 1 | 75.00 | 15.00|
| 1003 | 5434 | 791 | 2 | 199.99| 2 |100.00 | 20.00|

-- Query: What's total sales amount?
SELECT SUM(amount) FROM fact_sales;

-- Row-oriented: Must read ALL 8 columns (64 bytes) per row
-- 1,000,000 rows × 64 bytes = 64 MB to answer one question!

The Solution: Columnar Storage

COLUMNAR (Column-oriented) storage:

sale_id:    | 1001 | 1002 | 1003 | ... |  (4M bytes)
customer_id:| 5432 | 5433 | 5434 | ... |  (4M bytes)
product_id: | 789  | 790  | 791  | ... |  (4M bytes)
store_id:   | 2    | 2    | 2    | ... |  (4M bytes)
amount:     | 99.99| 149.99| 199.99| ...|  (16M bytes) ← Only read this!
quantity:   | 1    | 1    | 2    | ... |  (4M bytes)
cost:       | 50   | 75   | 100  | ... |  (16M bytes)
discount:   | 5    | 15   | 20   | ... |  (16M bytes)

Query: SUM(amount)
Columnar: Only read amount column (16 MB) — 4x smaller!

Compression: Why Columnar is Much Smaller

Columns are highly repetitive (e.g., store_id = 2 for 100K rows):

Row-oriented: 100,000 × 64 bytes = 6.4 MB (can't compress much)
Columnar: 
  - amount (only floats): 100,000 × 8 bytes = 800 KB → 200KB (80% compressed)
  - store_id (integers 1-10): 100,000 integers → 1MB compressed (90% ratio)
  - Total: ~10x compression vs. uncompressed

Result: 6.4 MB → 0.64 MB (10x smaller!)

BigQuery uses: Dremel (columnar) → 10-100x compression
Snowflake uses: Parquet (columnar) → 5-20x compression
Databricks uses: Delta format (columnar Parquet) → 5-20x compression


Optimization Strategy 2: Partitioning (Divide & Conquer)

Without Partitioning

CREATE TABLE gold.fact_sales (
  sale_id INT,
  sale_date DATE,
  customer_id INT,
  amount DECIMAL,
  ...
);

-- Query: Sales for date range (common case)
SELECT SUM(amount) FROM gold.fact_sales
WHERE sale_date BETWEEN '2024-01-01' AND '2024-03-31';

-- Without partitioning: Must scan entire table (1 billion rows)
-- With partitioning, below:

With Partition Pruning

-- BigQuery: Automatically partitions by date
CREATE TABLE gold.fact_sales
PARTITION BY DATE(sale_date)
AS SELECT ...;

-- Query: Scans only 3 months of data (not 5 years)
SELECT SUM(amount) FROM gold.fact_sales
WHERE sale_date BETWEEN '2024-01-01' AND '2024-03-31';

-- Partition pruning: Query engine skips unneeded partitions
-- 5 years total = 1,800 days × 1M rows/day = 1.8B rows
-- 3 months = 90 days × 1M rows/day = 90M rows
-- 20x reduction in data scanned!

Partitioning Strategies

Strategy Pros Cons
Date ✅ Natural for time-series data, partition pruning easy ❌ Uneven (Feb vs. Jan)
Range (e.g., age 0-18, 19-35) ✅ Even distribution ❌ Less natural, queries need rewriting
Hash ✅ Even distribution ❌ Loses temporal locality
List (e.g., country codes) ✅ Natural for categorical ❌ Many partitions (metadata overhead)

Best practice: Partition by DATE (or MONTH for large tables).


Optimization Strategy 3: Clustering (Speed Up JOINs)

Without Clustering

-- Fact table and customer dimension (not clustered)
SELECT SUM(f.amount), c.segment
FROM gold.fact_sales f
JOIN gold.dim_customer c ON f.customer_key = c.customer_key
WHERE f.sale_date = '2024-12-25'
GROUP BY c.segment;

-- For each sale row, must find matching customer
-- Without clustering: Random disk seeks (expensive!)

With Clustering

-- BigQuery: Cluster fact by customer_key + date
CREATE TABLE gold.fact_sales
PARTITION BY DATE(sale_date)
CLUSTER BY customer_key, product_key
AS SELECT ...;

-- Similarly for dimension
CREATE TABLE gold.dim_customer
CLUSTER BY customer_key
AS SELECT ...;

-- Query benefit: Data locality
-- - All rows with customer_key = 5432 are stored together
-- - JOIN is much faster (sequential read vs. seeks)
-- - Second-level: product_key clustering helps GROUP BY

When to Cluster

Column Type Cluster? Reason
Foreign key (dimension ID) ✅ Yes Enables fast JOINs
GROUP BY column ✅ Yes Speeds aggregation
High cardinality (100K+ unique) ❌ No Clustering overhead > benefit
Dimension key ✅ Yes Makes fact-to-dim lookup faster

Optimization Strategy 4: Indexes

OLAP Doesn't Use Indexes (Mostly)

-- OLTP: B-tree index on frequently filtered column
CREATE INDEX idx_customer_id ON orders(customer_id);

-- Query planner: Use index to find 5 rows quickly
SELECT * FROM orders WHERE customer_id = 12345;

-- OLAP: Column-oriented storage is the index!
SELECT SUM(amount) FROM fact_sales WHERE customer_id = 12345;

-- Columnar: No index needed, column store itself is optimized

Exception: Snowflake allows search optimization (like a hidden index).

When Indexes Help in OLAP

  • ❌ Fact table scans (use partitioning + clustering instead)
  • ✅ Dimension tables (small enough that index matters)
  • ✅ High-cardinality WHERE clauses on small tables

Optimization Strategy 5: Materialized Views (Pre-aggregation)

Without Pre-aggregation

-- Query: Monthly revenue by product
SELECT 
  DATE_TRUNC('MONTH', sale_date) as month,
  product_id,
  SUM(amount) as revenue
FROM gold.fact_sales
GROUP BY 1, 2;

-- Runs full scan + aggregation every time (expensive!)

With Materialized View

-- Pre-compute common aggregation
CREATE MATERIALIZED VIEW gold.mv_monthly_revenue AS
SELECT 
  DATE_TRUNC('MONTH', sale_date) as month,
  product_id,
  SUM(amount) as revenue,
  COUNT(*) as num_transactions
FROM gold.fact_sales
GROUP BY 1, 2;

-- Query: Uses pre-materialized result (instant!)
SELECT * FROM gold.mv_monthly_revenue
WHERE month = '2024-12-01';

Cost: Extra storage, must refresh regularly
Benefit: Query latency: 50 seconds → 0.5 seconds

When to Use Materialized Views

Scenario Use MV?
Expensive aggregation (1000 row GROUP BY) ✅ Yes
Static or slowly changing data ✅ Yes
Real-time data (updates hourly) ⚠️ Refresh cost high
Simple query (already fast) ❌ No

Optimization Strategy 6: Denormalization (Intentional Redundancy)

Without Denormalization

-- Normalized: Multiple JOINs
SELECT 
  f.amount,
  c.customer_name,
  c.segment,
  p.product_name,
  p.category,
  d.month
FROM fact_sales f
JOIN dim_customer c ON f.customer_key = c.customer_key
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE f.sale_date > '2024-01-01';

With Denormalization (Gold Layer Fact)

-- Denormalized: Store dimension attributes in fact
CREATE TABLE gold.fact_sales_enriched AS
SELECT 
  f.amount,
  c.customer_name,     -- Copied from dim_customer
  c.segment,
  p.product_name,      -- Copied from dim_product
  p.category,
  d.month              -- Copied from dim_date
FROM fact_sales f
JOIN dim_customer c ON f.customer_key = c.customer_key
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key;

-- Query: No JOINs needed!
SELECT SUM(amount), segment 
FROM gold.fact_sales_enriched
GROUP BY segment;

Cost: 3-5x storage (attributes copied)
Benefit: Query speed: 30 seconds → 1 second


Cost Optimization Checklist

Optimization Effort Impact Priority
Partition by date Easy 10-100x data reduction ⭐ High
Select only needed columns Easy 5-10x data reduction ⭐ High
Cluster on FK/GROUP BY Medium 2-5x JOIN speed ⭐🔹 Medium
Materialized views Medium 10-1000x for heavy queries 🔹 Medium
Denormalize heavily-used tables High 10-100x query speed 🔹 Medium
Add indexes to dimensions Low 2-5x for tiny dimensions Low

Query Performance Profiling

Before Optimization: Profile First

-- BigQuery: Use EXPLAIN
EXPLAIN SELECT SUM(amount) FROM gold.fact_sales 
WHERE customer_id = 12345;

-- Shows: Data scanned, Shuffle operations, Bytes pushed down to I/O

-- Snowflake: Use EXPLAIN PLAN
EXPLAIN PLAN FOR SELECT ...;

-- Shows: Table scans, Join order, Filters pushed down

Common Bottlenecks

Symptom Cause Fix
Query scans entire table No partition pruning Add WHERE on partition column
High shuffle bytes (JOIN expensive) No clustering on FK Cluster on join column
Full scan on small table Accessing dimension No action needed (dimension is small)
Slow GROUP BY High cardinality column Pre-aggregate with MV
Billions of bytes scanned Querying all columns SELECT only needed columns

Summary Table: Optimization Techniques

Technique Reduces Effort Industry Use
Columnar storage Data scanned (80-90%) Built-in ✅ All (BigQuery, Snowflake, Databricks)
Partitioning Data scanned (90%) Low ✅ Standard
Clustering JOIN time (50-80%) Medium ✅ BigQuery, Databricks
Materialized views Query time (90%) Medium ✅ Common for heavy aggregations
Denormalization Query time (80%) High ✅ Gold layer standard
Indexes Niche (dimension lookup) Low ⚠️ Rarely needed

Deep-Dives

Deep Dive: Query Execution Plans — Reading EXPLAIN, cost estimation, join order optimization.


Key Takeaways

Columnar storage is the foundation: 10-100x compression vs. row-oriented.

Partition pruning (WHERE on partition column) = 10-100x data reduction.

Clustering on foreign keys speeds JOINs 2-5x.

Materialize expensive aggregations (pre-compute common queries).

Profile first: Use EXPLAIN to find actual bottlenecks before optimizing.

Cost equation: Faster queries = smaller data scanned = lower cost.


Practice Questions

  1. Why is columnar storage better than row-oriented for analytics?
  2. How does partition pruning reduce data scanned?
  3. When would you cluster a fact table? What columns?
  4. Should you create indexes on a BigQuery fact table? Why/why not?
  5. Design a materialized view for a slow query (GROUP BY 50 columns).