5.02 · Deep Dive: Schema Design & Advanced Dimensional Modeling
Level: Advanced Time to read: 20 min Pre-reading: 05 · Data Modeling · 5.01 · Dimensional Modeling After reading: You'll understand advanced schema design patterns: fact table grains, degenerate dimensions, bridge tables, and real-world design tradeoffs.
The Schema Design Challenge
Given: A business requirement for sales analytics
Requirements:
- Analyze sales by product, store, customer, time, promotion
- Support "drill-down" from year → month → day
- Support "what-if" analysis (what if we changed price?)
- Track historical changes for audit
- Query performance must be < 5 seconds
Question: How to structure fact and dimension tables?
Advanced Pattern 1: Fact Table Grain
Definition: Grain = level of detail of a fact table (the row count).
Anti-Pattern: Mixed Grain
-- ❌ BAD: Mixed grain (some rows are daily, some are monthly)
CREATE TABLE fact_sales_bad (
sale_id INT,
date DATE,
amount DECIMAL,
grain STRING -- "daily" or "monthly"?
);
-- Query: SUM(amount) GROUP BY date
-- Result: WRONG! Some dates have multiple rows (daily), others have one (monthly)
Pattern: Explicit Single Grain
-- ✅ GOOD: Atomic grain (one row per sale transaction)
CREATE TABLE fact_sales_atomic (
fact_id INT,
order_id INT,
order_line_item INT,
transaction_date DATE,
customer_id INT,
product_id INT,
quantity INT,
unit_price DECIMAL,
extended_price DECIMAL,
CONSTRAINT pk_fact_sales PRIMARY KEY (order_id, order_line_item)
);
-- Grain: One row per line item (atomic/most granular)
-- ✅ GOOD: Aggregate grain (for performance)
CREATE TABLE fact_sales_daily (
transaction_date DATE,
customer_id INT,
product_id INT,
num_transactions INT,
total_quantity INT,
total_extended_price DECIMAL,
CONSTRAINT pk_fact_daily PRIMARY KEY (transaction_date, customer_id, product_id)
);
-- Grain: One row per customer-product-day (summarized)
-- Query using atomic table
SELECT
dd.year,
dp.category,
SUM(f.extended_price) as revenue
FROM fact_sales_atomic f
JOIN dim_date dd ON f.transaction_date = dd.date
JOIN dim_product dp ON f.product_id = dp.product_id
GROUP BY dd.year, dp.category;
-- Query using aggregate table (faster, but less flexible)
SELECT
dd.year,
dp.category,
SUM(f.total_extended_price) as revenue
FROM fact_sales_daily f
JOIN dim_date dd ON f.transaction_date = dd.date
JOIN dim_product dp ON f.product_id = dp.product_id
GROUP BY dd.year, dp.category;
Grain Selection Matrix
| Grain | Pros | Cons | Use Case |
|---|---|---|---|
| Atomic (line item) | Most flexible, supports ad-hoc queries | Largest (most rows) | Main analytical warehouse |
| Daily (date + dimensions) | Balance of flexibility and size | Less flexible | BI dashboards (pre-aggregated) |
| Monthly (month + dimensions) | Smallest, fastest queries | Least flexible | Executive reports |
Advanced Pattern 2: Degenerate Dimensions
Definition: Attributes specific to a transaction (not a reusable dimension).
-- Example: Order details that don't need a dimension table
CREATE TABLE fact_sales (
order_id INT, -- Degenerate dimension
order_line_item INT, -- Degenerate dimension
order_date DATE,
-- Foreign keys to dimension tables
customer_key INT,
product_key INT,
-- Measures
quantity INT,
unit_price DECIMAL,
-- Degenerate dimensions (transaction-specific, not dimensions)
order_priority VARCHAR(10), -- 'HIGH', 'MEDIUM', 'LOW'
order_status VARCHAR(20), -- 'COMPLETED', 'CANCELLED'
discount_code VARCHAR(50), -- 'HOLIDAY2024'
shipping_method VARCHAR(20) -- 'STANDARD', 'EXPEDITED'
);
Why degenerate?
- Attributes are specific to the transaction (order), not shared across transactions
- Don't need a separate dimension table
- No need to join (already in fact table)
- Reduce query joins
Advanced Pattern 3: Bridge Tables (Many-to-Many)
Problem: A product can belong to multiple categories (many-to-many relationship).
Product → Category
└─ "Laptop Pro" belongs to both "Electronics" AND "Computers"
Standard JOIN would create:
Laptop | Electronics
Laptop | Computers
Result: Duplicated fact rows!
Solution: Bridge Table
-- Bridge table: Maps products to multiple categories
CREATE TABLE dim_product_category_bridge (
product_key INT,
category_key INT,
is_primary BOOLEAN, -- Mark primary category
CONSTRAINT pk_bridge PRIMARY KEY (product_key, category_key)
);
-- Dimension: Products (without category FK)
CREATE TABLE dim_product (
product_key INT PRIMARY KEY,
product_id INT,
product_name STRING,
-- NO category_key (avoids 1-to-many issue)
);
-- Dimension: Categories
CREATE TABLE dim_category (
category_key INT PRIMARY KEY,
category_id INT,
category_name STRING
);
-- Fact: Sales (no category, keeps grain clean)
CREATE TABLE fact_sales (
fact_id INT PRIMARY KEY,
product_key INT,
customer_key INT,
amount DECIMAL
);
-- Query: Sales by category (handles many-to-many correctly)
SELECT
dc.category_name,
SUM(f.amount) as revenue
FROM fact_sales f
JOIN dim_product dp ON f.product_key = dp.product_key
JOIN dim_product_category_bridge bridge ON dp.product_key = bridge.product_key
JOIN dim_category dc ON bridge.category_key = dc.category_key
WHERE bridge.is_primary = TRUE -- Avoid double-counting
GROUP BY dc.category_name;
Advanced Pattern 4: Conformed Dimensions
Definition: The same dimension definition used consistently across multiple fact tables.
Anti-Pattern: Divergent Dimensions
-- ❌ BAD: Multiple different customer dimensions
CREATE TABLE dim_customer_sales (
customer_key INT,
customer_id INT,
name STRING,
city STRING
);
CREATE TABLE dim_customer_marketing (
customer_key INT,
customer_id INT,
full_name STRING, -- Different naming!
city STRING,
segment STRING -- Not in sales dim!
);
-- Query across both fact tables: CONFUSING!
-- Which customer dimension to use?
-- customer_id '123' might have different definitions
Pattern: Single Conformed Dimension
-- ✅ GOOD: Single source of truth
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY,
customer_id INT,
full_name STRING,
email STRING,
city STRING,
state STRING,
segment STRING,
lifetime_value_bucket STRING,
effective_date DATE,
end_date DATE,
is_current BOOLEAN
);
-- Both fact tables use same dimension
CREATE TABLE fact_sales (
fact_id INT,
customer_key INT,
product_key INT,
amount DECIMAL
);
CREATE TABLE fact_marketing_spend (
fact_id INT,
customer_key INT,
campaign_key INT,
spend DECIMAL
);
-- Consistent queries
SELECT
dc.segment,
SUM(fs.amount) as sales,
SUM(fm.spend) as marketing_spend
FROM fact_sales fs
JOIN fact_marketing_spend fm
ON fs.customer_key = fm.customer_key
JOIN dim_customer dc ON fs.customer_key = dc.customer_key
GROUP BY dc.segment;
Advanced Pattern 5: Role-Playing Dimensions
Definition: Same dimension table used in multiple roles (via different foreign keys).
-- Dimension: Date (single table)
CREATE TABLE dim_date (
date_key INT PRIMARY KEY,
date_actual DATE,
year INT,
month INT,
day_of_month INT
);
-- Fact: Order (uses date in multiple roles)
CREATE TABLE fact_orders (
order_key INT,
-- Role 1: When order was placed
order_date_key INT,
-- Role 2: When order was shipped
ship_date_key INT,
-- Role 3: When order was delivered
delivery_date_key INT,
customer_key INT,
amount DECIMAL,
CONSTRAINT fk_order_date FOREIGN KEY (order_date_key) REFERENCES dim_date(date_key),
CONSTRAINT fk_ship_date FOREIGN KEY (ship_date_key) REFERENCES dim_date(date_key),
CONSTRAINT fk_delivery_date FOREIGN KEY (delivery_date_key) REFERENCES dim_date(date_key)
);
-- Query: Orders delivered late
SELECT
f.order_key,
do.date_actual as order_date,
ds.date_actual as ship_date,
dd.date_actual as delivery_date,
dd.date_actual - ds.date_actual as delivery_delay_days
FROM fact_orders f
JOIN dim_date do ON f.order_date_key = do.date_key
JOIN dim_date ds ON f.ship_date_key = ds.date_key
JOIN dim_date dd ON f.delivery_date_key = dd.date_key
WHERE dd.date_actual - ds.date_actual > 7; -- Delivered > 7 days late
Advanced Pattern 6: Slowly Changing Dimensions - Hybrid Approach
Problem: Track change history while maintaining query simplicity.
Pattern: SCD Type 2 with Current Flag (Best Practice)
CREATE TABLE dim_customer_scd2 (
customer_key INT PRIMARY KEY,
customer_id INT,
name STRING,
address STRING,
city STRING,
-- SCD Type 2 tracking
effective_date DATE,
end_date DATE,
is_current BOOLEAN,
change_reason STRING, -- Why did it change?
-- Lineage
_dbt_updated_at TIMESTAMP,
_version_number INT
);
-- Example data
INSERT INTO dim_customer_scd2 VALUES
(1, 100, 'John Doe', '123 Old St', 'NYC', '2024-01-01', NULL, TRUE, 'Initial', NOW(), 1),
(2, 100, 'John Doe', '456 New St', 'Boston', '2024-06-01', NULL, TRUE, 'Moved', NOW(), 2);
-- Query as of date (snapshot)
SELECT * FROM dim_customer_scd2
WHERE is_current = TRUE; -- Latest version
-- Historical analysis
SELECT
customer_id,
address,
city,
effective_date,
end_date
FROM dim_customer_scd2
WHERE customer_id = 100
ORDER BY effective_date;
Performance Optimization via Schema Design
Pattern: Denormalization for Speed
-- Rather than joining 5 tables every query:
-- ❌ SELECT d.year, p.category, SUM(f.amount)
-- FROM fact f
-- JOIN dim_date d, dim_product p, dim_customer c, dim_store s
-- WHERE ...
-- ✅ Denormalize common attributes into fact:
CREATE TABLE fact_sales_denormalized (
sale_id INT,
date DATE,
year INT, -- Denormalized from dim_date
month INT,
product_category STRING, -- Denormalized from dim_product
product_name STRING,
customer_segment STRING, -- Denormalized from dim_customer
store_region STRING, -- Denormalized from dim_store
amount DECIMAL
);
-- Query is now simpler and faster
SELECT year, product_category, SUM(amount)
FROM fact_sales_denormalized
GROUP BY year, product_category;
Pattern: Surrogate Keys for Performance
-- Natural keys are semantic (business meaning)
CREATE TABLE dim_product_natural (
product_code STRING, -- Business key (e.g., "SKU-12345")
product_name STRING,
category STRING,
price DECIMAL
);
-- Surrogate keys are numeric (faster joins)
CREATE TABLE dim_product_surrogate (
product_key INT PRIMARY KEY, -- Auto-increment
product_code STRING, -- Natural key (for traceability)
product_name STRING,
category STRING,
price DECIMAL
);
-- Fact table uses surrogate keys
CREATE TABLE fact_sales (
sale_id INT,
product_key INT, -- Numeric join is faster
customer_key INT,
amount DECIMAL
);
Schema Design Best Practices
✅ Keep fact tables atomic (most granular grain) ✅ Use surrogate keys (INT/BIGINT instead of STRING) ✅ Maintain conformed dimensions (single source of truth) ✅ Document grain explicitly (fact table comment) ✅ Use SCD Type 2 for historical tracking ✅ Test with actual data volumes (1B+ rows)
❌ Don't mix grains (atomic + aggregate in same table) ❌ Don't use natural keys in fact tables (too large, change over time) ❌ Don't create one dimension per fact (duplicates logic) ❌ Don't assume "right" schema (test with queries first)
Real-World Example: E-commerce Dimensional Model
Fact: fact_orders (grain: one row per order line item)
├─ Dimensions:
│ ├─ dim_customer (SCD Type 2)
│ ├─ dim_product (SCD Type 2)
│ ├─ dim_store (SCD Type 2)
│ ├─ dim_date (role: order_date, ship_date, delivery_date)
│ └─ dim_promotion (bridge table for many-to-many)
│
├─ Degenerate dimensions:
│ ├─ order_id
│ ├─ order_status
│ └─ payment_method
Fact: fact_marketing_spend (grain: one row per campaign-day)
├─ Dimensions:
│ ├─ dim_customer_segment (conformed)
│ ├─ dim_campaign
│ └─ dim_date
Key Takeaways
- Grain defines fact table — atomic vs. aggregate
- Degenerate dimensions avoid unnecessary foreign keys
- Bridge tables handle many-to-many relationships
- Conformed dimensions maintain consistency
- SCD Type 2 preserves history for accurate analysis
- Denormalization improves query performance