Deep Dive: BigQuery Architecture and Optimization
Level: Intermediate to Advanced Pre-reading: 08 · Tools Ecosystem Time to read: 15 min
BigQuery Architecture Deep Dive
Columnar Storage: Dremel Engine
BigQuery uses Dremel, a columnar storage engine that compresses data 10-100x compared to row-oriented databases.
How Dremel compresses:
Column: Amount (DECIMAL)
Values: [99.99, 149.99, 199.99, 99.99, 149.99, ...]
Before compression:
99.99 (float, 8 bytes)
149.99 (float, 8 bytes)
→ 1M rows × 8 bytes = 8 MB
After compression (RLE + Dictionary):
{99.99, 149.99, 199.99} → [0, 1, 2, 0, 1, 2, ...]
Dictionary: 3 values × 8 bytes = 24 bytes
Indices: 1M × 1 byte = 1 MB
→ Total: ~1 MB (8x compression!)
Real-world: 10-100x compression depending on data distribution
Pricing: The $6.25/TB Model
Every query scans data and charges:
Query 1: SELECT SUM(amount) FROM fact_sales (1 TB table)
Scanned: 1 TB (even though you only summed 1 column!)
Cost: 1 × $6.25 = $6.25
Query 2 (Optimized): SELECT SUM(amount) FROM fact_sales (select only amount column)
Scanned: 0.1 TB (columnar: select just needed column)
Cost: 0.1 × $6.25 = $0.625 (10x cheaper!)
Lesson: SELECT * is very expensive. Always select only needed columns.
Best Practices for BigQuery
1. Partition by Date (Most Important)
--❌ BAD: Full table scan
SELECT SUM(revenue) FROM fact_sales
WHERE DATE(sale_date) = '2024-12-25';
-- BigQuery scans ALL data (5 years = 50 TB)
-- Cost: 50 × $6.25 = $312.50
-- Time: 30 seconds
--✅ GOOD: Partitioned table
CREATE TABLE fact_sales
PARTITION BY DATE(sale_date)
AS SELECT * FROM raw_sales;
SELECT SUM(revenue) FROM fact_sales
WHERE sale_date = '2024-12-25';
-- BigQuery scans only that date (10 GB)
-- Cost: 0.01 × $6.25 = $0.06 (5000x cheaper!)
-- Time: 0.5 seconds
2. Cluster on Foreign Keys
-- ❌ WITHOUT clustering: JOIN slow
SELECT f.amount, c.customer_name
FROM fact_sales f
JOIN dim_customer c ON f.customer_key = c.customer_key
WHERE f.sale_date >= '2024-01-01';
-- BigQuery scans fact (100 GB) + dim (full), random seeks
--✅ WITH clustering: JOIN fast
CREATE TABLE fact_sales
PARTITION BY DATE(sale_date)
CLUSTER BY customer_key, product_key
AS SELECT * FROM raw_sales;
-- BigQuery: Data co-located, sequential reads
-- Speed: 2-5x faster
3. Materialized Views for Expensive Aggregations
--❌ SLOW: Aggregate on-the-fly
SELECT
DATE_TRUNC(MONTH, sale_date) as month,
product_category,
SUM(amount) as revenue
FROM fact_sales
GROUP BY 1, 2;
-- Runs every time! Full table scan (50 TB, 30 seconds)
--✅ FAST: Pre-aggregate
CREATE MATERIALIZED VIEW mv_monthly_revenue
PARTITION BY month
CLUSTER BY product_category AS
SELECT
DATE_TRUNC(MONTH, sale_date) as month,
product_category,
SUM(amount) as revenue,
COUNT(*) as num_transactions
FROM fact_sales
GROUP BY 1, 2;
-- REFRESH MATERIALIZED VIEW mv_monthly_revenue;
SELECT * FROM mv_monthly_revenue
WHERE month = '2024-12-01';
-- Instant! (Pre-aggregated, already sorted)
Cost Optimization: Real Numbers
| Optimization | Impact | Effort |
|---|---|---|
| Column selection (SELECT only needed) | 10-50x | Low |
| Partition pruning (WHERE on partition col) | 10-100x | Low |
| Clustering (on FK) | 2-5x | Medium |
| Materialized views | 10-100x (for that query) | Medium |
| Reserved slots (annual) | 40% cost reduction | High |
ROI Example:
- Current: $100K/month on-demand BigQuery
- Apply partitioning: 50% reduction → $50K/month
- Add MV for heavy queries: 20% more reduction → $40K/month
- Reserve slots: 40% reduction → $24K/month
- Savings: \(76K/month (\)912K/year)
Nested/JSON Capabilities
BigQuery uniquely supports nested data:
-- Store nested customer addresses
CREATE TABLE customers (
customer_id INT,
name STRING,
addresses ARRAY<STRUCT<
street STRING,
city STRING,
zip STRING,
is_primary BOOL
>>
);
-- Query nested data directly (no JOIN needed!)
SELECT
name,
addr.city
FROM customers,
UNNEST(addresses) as addr
WHERE addr.is_primary = TRUE;
-- Advantage over MongoDB/PostgreSQL JSONB:
-- ✅ Full SQL support
-- ✅ Columnar compression (much smaller)
-- ✅ Parallel query execution
When BigQuery Isn't Ideal
❌ Real-time OLTP (latency: streaming inserts have 10-30s latency)
❌ Complex transactions (can only do within single row)
❌ Spatial queries (limited GIS support vs. PostGIS)
❌ Very tight cost control (<10 GB/month, maybe PostgreSQL cheaper)
Key Takeaways
✅ Partition by date — 10-100x cost reduction
✅ Cluster on foreign keys — 2-5x speed for JOINs
✅ Materialized views — Pre-aggregate expensive queries
✅ Reserved slots — 40% cost savings with annual commit
✅ Nested data — Unique advantage vs. competitors