6.01 · Deep Dive: Query Execution Plans & Analysis
Level: Advanced Time to read: 18 min Pre-reading: 06 · Query Optimization After reading: You'll understand how to read execution plans, identify bottlenecks, and optimize query performance systematically.
Understanding Query Execution Plans
Query Execution Plan = Breakdown of how database executes your query (step-by-step).
Example: Reading a BigQuery Execution Plan
Query:
SELECT
DATE(sale_date) as sale_date,
product_category,
SUM(amount) as total_revenue
FROM fact_sales
WHERE sale_date >= '2024-01-01'
GROUP BY 1, 2
LIMIT 100;
Execution Plan:
┌─ Step 1: TableScan [fact_sales]
│ └─ Estimated rows: 1,000,000
│ └─ Estimated bytes: 5 GB
│ └─ Filter: sale_date >= '2024-01-01'
│ └─ Estimated rows after filter: 250,000
│
├─ Step 2: ProjectSet
│ └─ Compute: DATE(sale_date), product_category
│ └─ Estimated rows: 250,000
│ └─ Estimated bytes: 500 MB
│
├─ Step 3: Aggregate
│ └─ GROUP BY: sale_date, product_category
│ └─ Aggregate: SUM(amount)
│ └─ Estimated rows: 5,000 (grouped)
│
└─ Step 4: Limit
└─ Return top 100 rows
What To Look For
| Aspect | Good Sign | Bad Sign |
|---|---|---|
| TableScan | Only necessary columns | All columns (SELECT *) |
| Filter | Large reduction (250K → 50K) | Small reduction (250K → 240K) |
| Join | Broadcast join | Hash join on large table |
| Aggregate | Few groups | Many groups (one per row?) |
| Sort | Not present | Large sort before limit |
Common Query Bottlenecks
Bottleneck 1: Full Table Scan
-- ❌ BAD: Scans all 100M rows, filters in-database
SELECT *
FROM fact_sales
WHERE sale_date >= '2024-01-01' -- Only 10% of data
LIMIT 100;
-- ✅ GOOD: Partitioned scan (read only 2024 partition)
SELECT *
FROM fact_sales
WHERE sale_date >= '2024-01-01'
LIMIT 100;
-- (If table is partitioned by sale_date, only scans relevant partition)
Cost difference:
❌ 100M rows × 1 KB per row = 100 GB scanned = $625 (BigQuery)
✅ 10M rows × 1 KB per row = 10 GB scanned = $62.50
Savings: 90%!
Bottleneck 2: Expensive Joins
-- ❌ BAD: Join then filter (2M × 500K = 1B rows intermediate)
SELECT f.order_id, p.product_name, f.amount
FROM fact_sales f
INNER JOIN dim_product p ON f.product_key = p.product_key
WHERE f.sale_date = '2024-01-15' -- Filter AFTER join
AND p.category = 'Electronics';
-- ✅ GOOD: Filter first, then join (100K × 500K = 50M rows intermediate)
SELECT f.order_id, p.product_name, f.amount
FROM fact_sales f
INNER JOIN dim_product p ON f.product_key = p.product_key
WHERE f.sale_date = '2024-01-15' -- Filter BEFORE join
AND p.category = 'Electronics';
-- Note: DB optimizer usually does this automatically,
-- but it's good practice to filter first in your SQL
Bottleneck 3: Expensive Aggregations
-- ❌ BAD: Aggregate 1M rows to 1,000 groups (excessive grouping)
SELECT customer_id, SUM(amount) as total
FROM fact_sales
GROUP BY customer_id; -- Creates 1M groups (one per customer)
-- ✅ GOOD: Aggregate 1M rows to 100 groups (fewer groups)
SELECT product_category, SUM(amount) as total
FROM fact_sales
GROUP BY product_category; -- Only 100 groups
-- Or: Use pre-aggregated table
SELECT *
FROM gold_customer_daily_summary -- Pre-computed!
WHERE date = '2024-01-15';
Optimization Techniques
Technique 1: Partitioning Strategy
-- Partitioned table by sale_date
CREATE TABLE fact_sales
PARTITION BY DATE(sale_date)
CLUSTER BY customer_key, product_key
AS SELECT * FROM staging_sales;
-- Query: Only scans Jan 2024 partition
SELECT SUM(amount)
FROM fact_sales
WHERE sale_date BETWEEN '2024-01-01' AND '2024-01-31';
-- Partition benefits:
-- ✅ Reduces data scanned (50% → 10%)
-- ✅ Faster queries (10 seconds → 1 second)
-- ✅ Lower cost ($100 → $20)
Technique 2: Clustering Strategy
-- Clustered by frequently-filtered columns
CREATE TABLE fact_sales
PARTITION BY DATE(sale_date)
CLUSTER BY customer_key, product_key
AS SELECT * FROM staging_sales;
-- Query: Skips customer_key blocks automatically
SELECT SUM(amount)
FROM fact_sales
WHERE sale_date BETWEEN '2024-01-01' AND '2024-01-31'
AND customer_key = 12345;
-- Benefits:
-- ✅ Row groups are ordered by cluster columns
-- ✅ Can skip entire blocks that don't match
-- ✅ 2-10x faster for filtered queries
Technique 3: Materialized Views / Aggregations
-- Pre-compute daily summary
CREATE TABLE gold_daily_summary AS
SELECT
DATE(sale_date) as sale_date,
product_category,
SUM(amount) as daily_revenue,
COUNT(*) as num_transactions
FROM fact_sales
GROUP BY DATE(sale_date), product_category;
-- Query aggregated table instead (1000x faster)
SELECT sale_date, product_category, daily_revenue
FROM gold_daily_summary
WHERE sale_date BETWEEN '2024-01-01' AND '2024-01-31';
-- Before: 1M rows aggregated = 2 seconds
-- After: 100 rows read from pre-aggregated = 0.1 seconds
-- Speedup: 20x
Technique 4: Approximate Queries
-- Exact query (scans all data)
SELECT COUNT(DISTINCT customer_id)
FROM fact_sales
WHERE sale_date >= '2024-01-01'; -- 5 seconds, 10 GB scanned
-- Approximate query (fast, good enough)
SELECT APPROX_COUNT_DISTINCT(customer_id)
FROM fact_sales
WHERE sale_date >= '2024-01-01'; -- 0.1 seconds, 100 MB scanned
-- Result: ~1,000,000 (exact: 1,000,050)
-- Accuracy: 99.995% with 100x speedup
Performance Analysis Checklist
| Question | How to Check |
|---|---|
| Is query partitioned correctly? | EXPLAIN shows partition pruning? |
| Are joins efficient? | EXPLAIN shows broadcast join? |
| Is data aggregated sufficiently? | GROUP BY has few groups (< 100K)? |
| Are indexes being used? | EXPLAIN shows index scan, not full table scan? |
| Is data cached? | Same query runs faster second time? |
| Are rows filtered early? | Most filtering in WHERE, not in subqueries? |
Real-World Example: Optimizing a Slow Dashboard Query
Original query (45 seconds, $2.50 cost):
SELECT
f.order_date,
p.product_name,
c.customer_name,
SUM(f.amount) as revenue
FROM fact_sales f
INNER JOIN dim_product p ON f.product_key = p.product_key
INNER JOIN dim_customer c ON f.customer_key = c.customer_key
INNER JOIN dim_store s ON f.store_key = s.store_key
WHERE f.order_date >= '2024-01-01'
GROUP BY f.order_date, p.product_name, c.customer_name;
Optimization 1: Use aggregated table (5 seconds, $0.25)
SELECT
sale_date,
product_name,
SUM(daily_revenue) as revenue
FROM gold_daily_revenue
WHERE sale_date >= '2024-01-01'
GROUP BY sale_date, product_name;
Optimization 2: Reduce dimensions (2 seconds, $0.10)
SELECT
DATE(sale_date) as sale_date,
product_category,
SUM(amount) as revenue
FROM fact_sales
WHERE sale_date >= '2024-01-01'
GROUP BY DATE(sale_date), product_category;
Result: 45 seconds → 2 seconds (22x faster), $2.50 → $0.10 (25x cheaper)
Key Takeaways
- Read execution plans to identify bottlenecks
- Partition tables by common filters (date, region)
- Cluster tables by frequently-filtered columns
- Use pre-aggregations for common queries
- Approximate queries when exact isn't needed
- Test performance with realistic data volumes