06 · Query Optimization: Making Data Warehouses Fast
Level: Intermediate to Advanced Time to read: 20 min Pre-reading: 02 · Warehouse Architecture After reading: You'll understand columnar storage, partitioning, clustering, indexes, and cost optimization strategies that make analytical queries fast.
The Cost Equation: Speed × Dollars
Every optimization decision in a data warehouse is a trade-off:
Examples:
- BigQuery: $6.25 per TB scanned (+ $0.04 per compute second)
- Snowflake: $1-4 per credit (1 credit ≈ 60 GB processed)
- Databricks: $0.30-0.50 per DBU-hour
Optimizing queries directly reduces cost and improves speed.
Optimization Strategy: Column Selection (Columnar Storage)
The Problem: Row-Oriented Databases
-- Fact table (row-oriented OLTP)
CREATE TABLE fact_sales (
sale_id INT, -- 4 bytes
customer_id INT, -- 4 bytes
product_id INT, -- 4 bytes
store_id INT, -- 4 bytes
amount DECIMAL, -- 16 bytes
quantity INT, -- 4 bytes
cost DECIMAL, -- 16 bytes
discount DECIMAL, -- 16 bytes
-- Total: ~64 bytes per row
);
-- ROW STORAGE: Like a spreadsheet row
| 1001 | 5432 | 789 | 2 | 99.99 | 1 | 50.00 | 5.00 |
| 1002 | 5433 | 790 | 2 | 149.99| 1 | 75.00 | 15.00|
| 1003 | 5434 | 791 | 2 | 199.99| 2 |100.00 | 20.00|
-- Query: What's total sales amount?
SELECT SUM(amount) FROM fact_sales;
-- Row-oriented: Must read ALL 8 columns (64 bytes) per row
-- 1,000,000 rows × 64 bytes = 64 MB to answer one question!
The Solution: Columnar Storage
COLUMNAR (Column-oriented) storage:
sale_id: | 1001 | 1002 | 1003 | ... | (4M bytes)
customer_id:| 5432 | 5433 | 5434 | ... | (4M bytes)
product_id: | 789 | 790 | 791 | ... | (4M bytes)
store_id: | 2 | 2 | 2 | ... | (4M bytes)
amount: | 99.99| 149.99| 199.99| ...| (16M bytes) ← Only read this!
quantity: | 1 | 1 | 2 | ... | (4M bytes)
cost: | 50 | 75 | 100 | ... | (16M bytes)
discount: | 5 | 15 | 20 | ... | (16M bytes)
Query: SUM(amount)
Columnar: Only read amount column (16 MB) — 4x smaller!
Compression: Why Columnar is Much Smaller
Columns are highly repetitive (e.g., store_id = 2 for 100K rows):
Row-oriented: 100,000 × 64 bytes = 6.4 MB (can't compress much)
Columnar:
- amount (only floats): 100,000 × 8 bytes = 800 KB → 200KB (80% compressed)
- store_id (integers 1-10): 100,000 integers → 1MB compressed (90% ratio)
- Total: ~10x compression vs. uncompressed
Result: 6.4 MB → 0.64 MB (10x smaller!)
BigQuery uses: Dremel (columnar) → 10-100x compression
Snowflake uses: Parquet (columnar) → 5-20x compression
Databricks uses: Delta format (columnar Parquet) → 5-20x compression
Optimization Strategy 2: Partitioning (Divide & Conquer)
Without Partitioning
CREATE TABLE gold.fact_sales (
sale_id INT,
sale_date DATE,
customer_id INT,
amount DECIMAL,
...
);
-- Query: Sales for date range (common case)
SELECT SUM(amount) FROM gold.fact_sales
WHERE sale_date BETWEEN '2024-01-01' AND '2024-03-31';
-- Without partitioning: Must scan entire table (1 billion rows)
-- With partitioning, below:
With Partition Pruning
-- BigQuery: Automatically partitions by date
CREATE TABLE gold.fact_sales
PARTITION BY DATE(sale_date)
AS SELECT ...;
-- Query: Scans only 3 months of data (not 5 years)
SELECT SUM(amount) FROM gold.fact_sales
WHERE sale_date BETWEEN '2024-01-01' AND '2024-03-31';
-- Partition pruning: Query engine skips unneeded partitions
-- 5 years total = 1,800 days × 1M rows/day = 1.8B rows
-- 3 months = 90 days × 1M rows/day = 90M rows
-- 20x reduction in data scanned!
Partitioning Strategies
| Strategy | Pros | Cons |
|---|---|---|
| Date | ✅ Natural for time-series data, partition pruning easy | ❌ Uneven (Feb vs. Jan) |
| Range (e.g., age 0-18, 19-35) | ✅ Even distribution | ❌ Less natural, queries need rewriting |
| Hash | ✅ Even distribution | ❌ Loses temporal locality |
| List (e.g., country codes) | ✅ Natural for categorical | ❌ Many partitions (metadata overhead) |
Best practice: Partition by DATE (or MONTH for large tables).
Optimization Strategy 3: Clustering (Speed Up JOINs)
Without Clustering
-- Fact table and customer dimension (not clustered)
SELECT SUM(f.amount), c.segment
FROM gold.fact_sales f
JOIN gold.dim_customer c ON f.customer_key = c.customer_key
WHERE f.sale_date = '2024-12-25'
GROUP BY c.segment;
-- For each sale row, must find matching customer
-- Without clustering: Random disk seeks (expensive!)
With Clustering
-- BigQuery: Cluster fact by customer_key + date
CREATE TABLE gold.fact_sales
PARTITION BY DATE(sale_date)
CLUSTER BY customer_key, product_key
AS SELECT ...;
-- Similarly for dimension
CREATE TABLE gold.dim_customer
CLUSTER BY customer_key
AS SELECT ...;
-- Query benefit: Data locality
-- - All rows with customer_key = 5432 are stored together
-- - JOIN is much faster (sequential read vs. seeks)
-- - Second-level: product_key clustering helps GROUP BY
When to Cluster
| Column Type | Cluster? | Reason |
|---|---|---|
| Foreign key (dimension ID) | ✅ Yes | Enables fast JOINs |
| GROUP BY column | ✅ Yes | Speeds aggregation |
| High cardinality (100K+ unique) | ❌ No | Clustering overhead > benefit |
| Dimension key | ✅ Yes | Makes fact-to-dim lookup faster |
Optimization Strategy 4: Indexes
OLAP Doesn't Use Indexes (Mostly)
-- OLTP: B-tree index on frequently filtered column
CREATE INDEX idx_customer_id ON orders(customer_id);
-- Query planner: Use index to find 5 rows quickly
SELECT * FROM orders WHERE customer_id = 12345;
-- OLAP: Column-oriented storage is the index!
SELECT SUM(amount) FROM fact_sales WHERE customer_id = 12345;
-- Columnar: No index needed, column store itself is optimized
Exception: Snowflake allows search optimization (like a hidden index).
When Indexes Help in OLAP
- ❌ Fact table scans (use partitioning + clustering instead)
- ✅ Dimension tables (small enough that index matters)
- ✅ High-cardinality WHERE clauses on small tables
Optimization Strategy 5: Materialized Views (Pre-aggregation)
Without Pre-aggregation
-- Query: Monthly revenue by product
SELECT
DATE_TRUNC('MONTH', sale_date) as month,
product_id,
SUM(amount) as revenue
FROM gold.fact_sales
GROUP BY 1, 2;
-- Runs full scan + aggregation every time (expensive!)
With Materialized View
-- Pre-compute common aggregation
CREATE MATERIALIZED VIEW gold.mv_monthly_revenue AS
SELECT
DATE_TRUNC('MONTH', sale_date) as month,
product_id,
SUM(amount) as revenue,
COUNT(*) as num_transactions
FROM gold.fact_sales
GROUP BY 1, 2;
-- Query: Uses pre-materialized result (instant!)
SELECT * FROM gold.mv_monthly_revenue
WHERE month = '2024-12-01';
Cost: Extra storage, must refresh regularly
Benefit: Query latency: 50 seconds → 0.5 seconds
When to Use Materialized Views
| Scenario | Use MV? |
|---|---|
| Expensive aggregation (1000 row GROUP BY) | ✅ Yes |
| Static or slowly changing data | ✅ Yes |
| Real-time data (updates hourly) | ⚠️ Refresh cost high |
| Simple query (already fast) | ❌ No |
Optimization Strategy 6: Denormalization (Intentional Redundancy)
Without Denormalization
-- Normalized: Multiple JOINs
SELECT
f.amount,
c.customer_name,
c.segment,
p.product_name,
p.category,
d.month
FROM fact_sales f
JOIN dim_customer c ON f.customer_key = c.customer_key
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE f.sale_date > '2024-01-01';
With Denormalization (Gold Layer Fact)
-- Denormalized: Store dimension attributes in fact
CREATE TABLE gold.fact_sales_enriched AS
SELECT
f.amount,
c.customer_name, -- Copied from dim_customer
c.segment,
p.product_name, -- Copied from dim_product
p.category,
d.month -- Copied from dim_date
FROM fact_sales f
JOIN dim_customer c ON f.customer_key = c.customer_key
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key;
-- Query: No JOINs needed!
SELECT SUM(amount), segment
FROM gold.fact_sales_enriched
GROUP BY segment;
Cost: 3-5x storage (attributes copied)
Benefit: Query speed: 30 seconds → 1 second
Cost Optimization Checklist
| Optimization | Effort | Impact | Priority |
|---|---|---|---|
| Partition by date | Easy | 10-100x data reduction | ⭐ High |
| Select only needed columns | Easy | 5-10x data reduction | ⭐ High |
| Cluster on FK/GROUP BY | Medium | 2-5x JOIN speed | ⭐🔹 Medium |
| Materialized views | Medium | 10-1000x for heavy queries | 🔹 Medium |
| Denormalize heavily-used tables | High | 10-100x query speed | 🔹 Medium |
| Add indexes to dimensions | Low | 2-5x for tiny dimensions | Low |
Query Performance Profiling
Before Optimization: Profile First
-- BigQuery: Use EXPLAIN
EXPLAIN SELECT SUM(amount) FROM gold.fact_sales
WHERE customer_id = 12345;
-- Shows: Data scanned, Shuffle operations, Bytes pushed down to I/O
-- Snowflake: Use EXPLAIN PLAN
EXPLAIN PLAN FOR SELECT ...;
-- Shows: Table scans, Join order, Filters pushed down
Common Bottlenecks
| Symptom | Cause | Fix |
|---|---|---|
| Query scans entire table | No partition pruning | Add WHERE on partition column |
| High shuffle bytes (JOIN expensive) | No clustering on FK | Cluster on join column |
| Full scan on small table | Accessing dimension | No action needed (dimension is small) |
| Slow GROUP BY | High cardinality column | Pre-aggregate with MV |
| Billions of bytes scanned | Querying all columns | SELECT only needed columns |
Summary Table: Optimization Techniques
| Technique | Reduces | Effort | Industry Use |
|---|---|---|---|
| Columnar storage | Data scanned (80-90%) | Built-in | ✅ All (BigQuery, Snowflake, Databricks) |
| Partitioning | Data scanned (90%) | Low | ✅ Standard |
| Clustering | JOIN time (50-80%) | Medium | ✅ BigQuery, Databricks |
| Materialized views | Query time (90%) | Medium | ✅ Common for heavy aggregations |
| Denormalization | Query time (80%) | High | ✅ Gold layer standard |
| Indexes | Niche (dimension lookup) | Low | ⚠️ Rarely needed |
Deep-Dives
→ Deep Dive: Query Execution Plans — Reading EXPLAIN, cost estimation, join order optimization.
Key Takeaways
✅ Columnar storage is the foundation: 10-100x compression vs. row-oriented.
✅ Partition pruning (WHERE on partition column) = 10-100x data reduction.
✅ Clustering on foreign keys speeds JOINs 2-5x.
✅ Materialize expensive aggregations (pre-compute common queries).
✅ Profile first: Use EXPLAIN to find actual bottlenecks before optimizing.
✅ Cost equation: Faster queries = smaller data scanned = lower cost.
Practice Questions
- Why is columnar storage better than row-oriented for analytics?
- How does partition pruning reduce data scanned?
- When would you cluster a fact table? What columns?
- Should you create indexes on a BigQuery fact table? Why/why not?
- Design a materialized view for a slow query (GROUP BY 50 columns).