6.01 · Deep Dive: Query Execution Plans & Analysis

Level: Advanced Time to read: 18 min Pre-reading: 06 · Query Optimization After reading: You'll understand how to read execution plans, identify bottlenecks, and optimize query performance systematically.


Understanding Query Execution Plans

Query Execution Plan = Breakdown of how database executes your query (step-by-step).

Example: Reading a BigQuery Execution Plan

Query:
SELECT 
  DATE(sale_date) as sale_date,
  product_category,
  SUM(amount) as total_revenue
FROM fact_sales
WHERE sale_date >= '2024-01-01'
GROUP BY 1, 2
LIMIT 100;

Execution Plan:
┌─ Step 1: TableScan [fact_sales]
│  └─ Estimated rows: 1,000,000
│  └─ Estimated bytes: 5 GB
│  └─ Filter: sale_date >= '2024-01-01'
│     └─ Estimated rows after filter: 250,000
├─ Step 2: ProjectSet
│  └─ Compute: DATE(sale_date), product_category
│  └─ Estimated rows: 250,000
│  └─ Estimated bytes: 500 MB
├─ Step 3: Aggregate
│  └─ GROUP BY: sale_date, product_category
│  └─ Aggregate: SUM(amount)
│  └─ Estimated rows: 5,000 (grouped)
└─ Step 4: Limit
   └─ Return top 100 rows

What To Look For

Aspect Good Sign Bad Sign
TableScan Only necessary columns All columns (SELECT *)
Filter Large reduction (250K → 50K) Small reduction (250K → 240K)
Join Broadcast join Hash join on large table
Aggregate Few groups Many groups (one per row?)
Sort Not present Large sort before limit

Common Query Bottlenecks

Bottleneck 1: Full Table Scan

-- ❌ BAD: Scans all 100M rows, filters in-database
SELECT *
FROM fact_sales
WHERE sale_date >= '2024-01-01'  -- Only 10% of data
LIMIT 100;

-- ✅ GOOD: Partitioned scan (read only 2024 partition)
SELECT *
FROM fact_sales
WHERE sale_date >= '2024-01-01'
LIMIT 100;
-- (If table is partitioned by sale_date, only scans relevant partition)

Cost difference:
 100M rows × 1 KB per row = 100 GB scanned = $625 (BigQuery)
 10M rows × 1 KB per row = 10 GB scanned = $62.50
Savings: 90%!

Bottleneck 2: Expensive Joins

-- ❌ BAD: Join then filter (2M × 500K = 1B rows intermediate)
SELECT f.order_id, p.product_name, f.amount
FROM fact_sales f
INNER JOIN dim_product p ON f.product_key = p.product_key
WHERE f.sale_date = '2024-01-15'  -- Filter AFTER join
AND p.category = 'Electronics';

-- ✅ GOOD: Filter first, then join (100K × 500K = 50M rows intermediate)
SELECT f.order_id, p.product_name, f.amount
FROM fact_sales f
INNER JOIN dim_product p ON f.product_key = p.product_key
WHERE f.sale_date = '2024-01-15'  -- Filter BEFORE join
  AND p.category = 'Electronics';

-- Note: DB optimizer usually does this automatically,
-- but it's good practice to filter first in your SQL

Bottleneck 3: Expensive Aggregations

-- ❌ BAD: Aggregate 1M rows to 1,000 groups (excessive grouping)
SELECT customer_id, SUM(amount) as total
FROM fact_sales
GROUP BY customer_id;  -- Creates 1M groups (one per customer)

-- ✅ GOOD: Aggregate 1M rows to 100 groups (fewer groups)
SELECT product_category, SUM(amount) as total
FROM fact_sales
GROUP BY product_category;  -- Only 100 groups

-- Or: Use pre-aggregated table
SELECT *
FROM gold_customer_daily_summary  -- Pre-computed!
WHERE date = '2024-01-15';

Optimization Techniques

Technique 1: Partitioning Strategy

-- Partitioned table by sale_date
CREATE TABLE fact_sales
PARTITION BY DATE(sale_date)
CLUSTER BY customer_key, product_key
AS SELECT * FROM staging_sales;

-- Query: Only scans Jan 2024 partition
SELECT SUM(amount)
FROM fact_sales
WHERE sale_date BETWEEN '2024-01-01' AND '2024-01-31';

-- Partition benefits:
-- ✅ Reduces data scanned (50% → 10%)
-- ✅ Faster queries (10 seconds → 1 second)
-- ✅ Lower cost ($100 → $20)

Technique 2: Clustering Strategy

-- Clustered by frequently-filtered columns
CREATE TABLE fact_sales
PARTITION BY DATE(sale_date)
CLUSTER BY customer_key, product_key
AS SELECT * FROM staging_sales;

-- Query: Skips customer_key blocks automatically
SELECT SUM(amount)
FROM fact_sales
WHERE sale_date BETWEEN '2024-01-01' AND '2024-01-31'
  AND customer_key = 12345;

-- Benefits:
-- ✅ Row groups are ordered by cluster columns
-- ✅ Can skip entire blocks that don't match
-- ✅ 2-10x faster for filtered queries

Technique 3: Materialized Views / Aggregations

-- Pre-compute daily summary
CREATE TABLE gold_daily_summary AS
SELECT
  DATE(sale_date) as sale_date,
  product_category,
  SUM(amount) as daily_revenue,
  COUNT(*) as num_transactions
FROM fact_sales
GROUP BY DATE(sale_date), product_category;

-- Query aggregated table instead (1000x faster)
SELECT sale_date, product_category, daily_revenue
FROM gold_daily_summary
WHERE sale_date BETWEEN '2024-01-01' AND '2024-01-31';

-- Before: 1M rows aggregated = 2 seconds
-- After: 100 rows read from pre-aggregated = 0.1 seconds
-- Speedup: 20x

Technique 4: Approximate Queries

-- Exact query (scans all data)
SELECT COUNT(DISTINCT customer_id)
FROM fact_sales
WHERE sale_date >= '2024-01-01';  -- 5 seconds, 10 GB scanned

-- Approximate query (fast, good enough)
SELECT APPROX_COUNT_DISTINCT(customer_id)
FROM fact_sales
WHERE sale_date >= '2024-01-01';  -- 0.1 seconds, 100 MB scanned

-- Result: ~1,000,000 (exact: 1,000,050)
-- Accuracy: 99.995% with 100x speedup

Performance Analysis Checklist

Question How to Check
Is query partitioned correctly? EXPLAIN shows partition pruning?
Are joins efficient? EXPLAIN shows broadcast join?
Is data aggregated sufficiently? GROUP BY has few groups (< 100K)?
Are indexes being used? EXPLAIN shows index scan, not full table scan?
Is data cached? Same query runs faster second time?
Are rows filtered early? Most filtering in WHERE, not in subqueries?

Real-World Example: Optimizing a Slow Dashboard Query

Original query (45 seconds, $2.50 cost):

SELECT
  f.order_date,
  p.product_name,
  c.customer_name,
  SUM(f.amount) as revenue
FROM fact_sales f
INNER JOIN dim_product p ON f.product_key = p.product_key
INNER JOIN dim_customer c ON f.customer_key = c.customer_key
INNER JOIN dim_store s ON f.store_key = s.store_key
WHERE f.order_date >= '2024-01-01'
GROUP BY f.order_date, p.product_name, c.customer_name;

Optimization 1: Use aggregated table (5 seconds, $0.25)

SELECT
  sale_date,
  product_name,
  SUM(daily_revenue) as revenue
FROM gold_daily_revenue
WHERE sale_date >= '2024-01-01'
GROUP BY sale_date, product_name;

Optimization 2: Reduce dimensions (2 seconds, $0.10)

SELECT
  DATE(sale_date) as sale_date,
  product_category,
  SUM(amount) as revenue
FROM fact_sales
WHERE sale_date >= '2024-01-01'
GROUP BY DATE(sale_date), product_category;

Result: 45 seconds → 2 seconds (22x faster), $2.50 → $0.10 (25x cheaper)


Key Takeaways

  1. Read execution plans to identify bottlenecks
  2. Partition tables by common filters (date, region)
  3. Cluster tables by frequently-filtered columns
  4. Use pre-aggregations for common queries
  5. Approximate queries when exact isn't needed
  6. Test performance with realistic data volumes