Deep Dive: BigQuery Architecture and Optimization

Level: Intermediate to Advanced Pre-reading: 08 · Tools Ecosystem Time to read: 15 min

BigQuery Architecture Deep Dive

Columnar Storage: Dremel Engine

BigQuery uses Dremel, a columnar storage engine that compresses data 10-100x compared to row-oriented databases.

How Dremel compresses:

Column: Amount (DECIMAL)
Values: [99.99, 149.99, 199.99, 99.99, 149.99, ...]

Before compression:
  99.99 (float, 8 bytes)
  149.99 (float, 8 bytes)
  → 1M rows × 8 bytes = 8 MB

After compression (RLE + Dictionary):
  {99.99, 149.99, 199.99} → [0, 1, 2, 0, 1, 2, ...]
  Dictionary: 3 values × 8 bytes = 24 bytes
  Indices: 1M × 1 byte = 1 MB
  → Total: ~1 MB (8x compression!)

Real-world: 10-100x compression depending on data distribution

Pricing: The $6.25/TB Model

Every query scans data and charges:

Query 1: SELECT SUM(amount) FROM fact_sales (1 TB table)
  Scanned: 1 TB (even though you only summed 1 column!)
  Cost: 1 × $6.25 = $6.25

Query 2 (Optimized): SELECT SUM(amount) FROM fact_sales (select only amount column)
  Scanned: 0.1 TB (columnar: select just needed column)
  Cost: 0.1 × $6.25 = $0.625  (10x cheaper!)

Lesson: SELECT * is very expensive. Always select only needed columns.

Best Practices for BigQuery

1. Partition by Date (Most Important)

--❌ BAD: Full table scan
SELECT SUM(revenue) FROM fact_sales 
WHERE DATE(sale_date) = '2024-12-25';

-- BigQuery scans ALL data (5 years = 50 TB)
-- Cost: 50 × $6.25 = $312.50
-- Time: 30 seconds

--✅ GOOD: Partitioned table
CREATE TABLE fact_sales
PARTITION BY DATE(sale_date)
AS SELECT * FROM raw_sales;

SELECT SUM(revenue) FROM fact_sales 
WHERE sale_date = '2024-12-25';

-- BigQuery scans only that date (10 GB)
-- Cost: 0.01 × $6.25 = $0.06  (5000x cheaper!)
-- Time: 0.5 seconds

2. Cluster on Foreign Keys

-- ❌ WITHOUT clustering: JOIN slow
SELECT f.amount, c.customer_name
FROM fact_sales f
JOIN dim_customer c ON f.customer_key = c.customer_key
  WHERE f.sale_date >= '2024-01-01';

-- BigQuery scans fact (100 GB) + dim (full), random seeks

--✅ WITH clustering: JOIN fast
CREATE TABLE fact_sales
PARTITION BY DATE(sale_date)
CLUSTER BY customer_key, product_key
AS SELECT * FROM raw_sales;

-- BigQuery: Data co-located, sequential reads
-- Speed: 2-5x faster

3. Materialized Views for Expensive Aggregations

--❌ SLOW: Aggregate on-the-fly
SELECT 
  DATE_TRUNC(MONTH, sale_date) as month,
  product_category,
  SUM(amount) as revenue
FROM fact_sales
GROUP BY 1, 2;

-- Runs every time! Full table scan (50 TB, 30 seconds)

--✅ FAST: Pre-aggregate
CREATE MATERIALIZED VIEW mv_monthly_revenue
PARTITION BY month
CLUSTER BY product_category AS
SELECT 
  DATE_TRUNC(MONTH, sale_date) as month,
  product_category,
  SUM(amount) as revenue,
  COUNT(*) as num_transactions
FROM fact_sales
GROUP BY 1, 2;

-- REFRESH MATERIALIZED VIEW mv_monthly_revenue;

SELECT * FROM mv_monthly_revenue
WHERE month = '2024-12-01';

-- Instant! (Pre-aggregated, already sorted)

Cost Optimization: Real Numbers

Optimization	Impact	Effort
Column selection (SELECT only needed)	10-50x	Low
Partition pruning (WHERE on partition col)	10-100x	Low
Clustering (on FK)	2-5x	Medium
Materialized views	10-100x (for that query)	Medium
Reserved slots (annual)	40% cost reduction	High

ROI Example:

Current: $100K/month on-demand BigQuery
Apply partitioning: 50% reduction → $50K/month
Add MV for heavy queries: 20% more reduction → $40K/month
Reserve slots: 40% reduction → $24K/month
Savings: $76K/month ($912K/year)

Nested/JSON Capabilities

BigQuery uniquely supports nested data:

-- Store nested customer addresses
CREATE TABLE customers (
  customer_id INT,
  name STRING,
  addresses ARRAY<STRUCT<
    street STRING,
    city STRING,
    zip STRING,
    is_primary BOOL
  >>
);

-- Query nested data directly (no JOIN needed!)
SELECT 
  name,
  addr.city
FROM customers,
  UNNEST(addresses) as addr
WHERE addr.is_primary = TRUE;

-- Advantage over MongoDB/PostgreSQL JSONB:
--  ✅ Full SQL support
--  ✅ Columnar compression (much smaller)
--  ✅ Parallel query execution

When BigQuery Isn't Ideal

❌ Real-time OLTP (latency: streaming inserts have 10-30s latency)
❌ Complex transactions (can only do within single row)
❌ Spatial queries (limited GIS support vs. PostGIS)
❌ Very tight cost control (<10 GB/month, maybe PostgreSQL cheaper)

Key Takeaways

✅ Partition by date — 10-100x cost reduction
✅ Cluster on foreign keys — 2-5x speed for JOINs
✅ Materialized views — Pre-aggregate expensive queries
✅ Reserved slots — 40% cost savings with annual commit
✅ Nested data — Unique advantage vs. competitors

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search