5.02 · Deep Dive: Schema Design & Advanced Dimensional Modeling

Level: Advanced Time to read: 20 min Pre-reading: 05 · Data Modeling · 5.01 · Dimensional Modeling After reading: You'll understand advanced schema design patterns: fact table grains, degenerate dimensions, bridge tables, and real-world design tradeoffs.


The Schema Design Challenge

Given: A business requirement for sales analytics
Requirements:

  • Analyze sales by product, store, customer, time, promotion
  • Support "drill-down" from year → month → day
  • Support "what-if" analysis (what if we changed price?)
  • Track historical changes for audit
  • Query performance must be < 5 seconds

Question: How to structure fact and dimension tables?


Advanced Pattern 1: Fact Table Grain

Definition: Grain = level of detail of a fact table (the row count).

Anti-Pattern: Mixed Grain

-- ❌ BAD: Mixed grain (some rows are daily, some are monthly)
CREATE TABLE fact_sales_bad (
  sale_id INT,
  date DATE,
  amount DECIMAL,
  grain STRING  -- "daily" or "monthly"?
);

-- Query: SUM(amount) GROUP BY date
-- Result: WRONG! Some dates have multiple rows (daily), others have one (monthly)

Pattern: Explicit Single Grain

-- ✅ GOOD: Atomic grain (one row per sale transaction)
CREATE TABLE fact_sales_atomic (
  fact_id INT,
  order_id INT,
  order_line_item INT,
  transaction_date DATE,
  customer_id INT,
  product_id INT,
  quantity INT,
  unit_price DECIMAL,
  extended_price DECIMAL,
  CONSTRAINT pk_fact_sales PRIMARY KEY (order_id, order_line_item)
);
-- Grain: One row per line item (atomic/most granular)

-- ✅ GOOD: Aggregate grain (for performance)
CREATE TABLE fact_sales_daily (
  transaction_date DATE,
  customer_id INT,
  product_id INT,
  num_transactions INT,
  total_quantity INT,
  total_extended_price DECIMAL,
  CONSTRAINT pk_fact_daily PRIMARY KEY (transaction_date, customer_id, product_id)
);
-- Grain: One row per customer-product-day (summarized)

-- Query using atomic table
SELECT
  dd.year,
  dp.category,
  SUM(f.extended_price) as revenue
FROM fact_sales_atomic f
JOIN dim_date dd ON f.transaction_date = dd.date
JOIN dim_product dp ON f.product_id = dp.product_id
GROUP BY dd.year, dp.category;

-- Query using aggregate table (faster, but less flexible)
SELECT
  dd.year,
  dp.category,
  SUM(f.total_extended_price) as revenue
FROM fact_sales_daily f
JOIN dim_date dd ON f.transaction_date = dd.date
JOIN dim_product dp ON f.product_id = dp.product_id
GROUP BY dd.year, dp.category;

Grain Selection Matrix

Grain Pros Cons Use Case
Atomic (line item) Most flexible, supports ad-hoc queries Largest (most rows) Main analytical warehouse
Daily (date + dimensions) Balance of flexibility and size Less flexible BI dashboards (pre-aggregated)
Monthly (month + dimensions) Smallest, fastest queries Least flexible Executive reports

Advanced Pattern 2: Degenerate Dimensions

Definition: Attributes specific to a transaction (not a reusable dimension).

-- Example: Order details that don't need a dimension table
CREATE TABLE fact_sales (
  order_id INT,                    -- Degenerate dimension
  order_line_item INT,             -- Degenerate dimension
  order_date DATE,

  -- Foreign keys to dimension tables
  customer_key INT,
  product_key INT,

  -- Measures
  quantity INT,
  unit_price DECIMAL,

  -- Degenerate dimensions (transaction-specific, not dimensions)
  order_priority VARCHAR(10),      -- 'HIGH', 'MEDIUM', 'LOW'
  order_status VARCHAR(20),        -- 'COMPLETED', 'CANCELLED'
  discount_code VARCHAR(50),       -- 'HOLIDAY2024'
  shipping_method VARCHAR(20)      -- 'STANDARD', 'EXPEDITED'
);

Why degenerate?

  • Attributes are specific to the transaction (order), not shared across transactions
  • Don't need a separate dimension table
  • No need to join (already in fact table)
  • Reduce query joins

Advanced Pattern 3: Bridge Tables (Many-to-Many)

Problem: A product can belong to multiple categories (many-to-many relationship).

Product → Category
  └─ "Laptop Pro" belongs to both "Electronics" AND "Computers"

Standard JOIN would create:
  Laptop  | Electronics
  Laptop  | Computers

Result: Duplicated fact rows!

Solution: Bridge Table

-- Bridge table: Maps products to multiple categories
CREATE TABLE dim_product_category_bridge (
  product_key INT,
  category_key INT,
  is_primary BOOLEAN,  -- Mark primary category
  CONSTRAINT pk_bridge PRIMARY KEY (product_key, category_key)
);

-- Dimension: Products (without category FK)
CREATE TABLE dim_product (
  product_key INT PRIMARY KEY,
  product_id INT,
  product_name STRING,
  -- NO category_key (avoids 1-to-many issue)
);

-- Dimension: Categories
CREATE TABLE dim_category (
  category_key INT PRIMARY KEY,
  category_id INT,
  category_name STRING
);

-- Fact: Sales (no category, keeps grain clean)
CREATE TABLE fact_sales (
  fact_id INT PRIMARY KEY,
  product_key INT,
  customer_key INT,
  amount DECIMAL
);

-- Query: Sales by category (handles many-to-many correctly)
SELECT
  dc.category_name,
  SUM(f.amount) as revenue
FROM fact_sales f
JOIN dim_product dp ON f.product_key = dp.product_key
JOIN dim_product_category_bridge bridge ON dp.product_key = bridge.product_key
JOIN dim_category dc ON bridge.category_key = dc.category_key
WHERE bridge.is_primary = TRUE  -- Avoid double-counting
GROUP BY dc.category_name;

Advanced Pattern 4: Conformed Dimensions

Definition: The same dimension definition used consistently across multiple fact tables.

Anti-Pattern: Divergent Dimensions

-- ❌ BAD: Multiple different customer dimensions
CREATE TABLE dim_customer_sales (
  customer_key INT,
  customer_id INT,
  name STRING,
  city STRING
);

CREATE TABLE dim_customer_marketing (
  customer_key INT,
  customer_id INT,
  full_name STRING,  -- Different naming!
  city STRING,
  segment STRING     -- Not in sales dim!
);

-- Query across both fact tables: CONFUSING!
-- Which customer dimension to use?
-- customer_id '123' might have different definitions

Pattern: Single Conformed Dimension

-- ✅ GOOD: Single source of truth
CREATE TABLE dim_customer (
  customer_key INT PRIMARY KEY,
  customer_id INT,
  full_name STRING,
  email STRING,
  city STRING,
  state STRING,
  segment STRING,
  lifetime_value_bucket STRING,
  effective_date DATE,
  end_date DATE,
  is_current BOOLEAN
);

-- Both fact tables use same dimension
CREATE TABLE fact_sales (
  fact_id INT,
  customer_key INT,
  product_key INT,
  amount DECIMAL
);

CREATE TABLE fact_marketing_spend (
  fact_id INT,
  customer_key INT,
  campaign_key INT,
  spend DECIMAL
);

-- Consistent queries
SELECT
  dc.segment,
  SUM(fs.amount) as sales,
  SUM(fm.spend) as marketing_spend
FROM fact_sales fs
JOIN fact_marketing_spend fm
  ON fs.customer_key = fm.customer_key
JOIN dim_customer dc ON fs.customer_key = dc.customer_key
GROUP BY dc.segment;

Advanced Pattern 5: Role-Playing Dimensions

Definition: Same dimension table used in multiple roles (via different foreign keys).

-- Dimension: Date (single table)
CREATE TABLE dim_date (
  date_key INT PRIMARY KEY,
  date_actual DATE,
  year INT,
  month INT,
  day_of_month INT
);

-- Fact: Order (uses date in multiple roles)
CREATE TABLE fact_orders (
  order_key INT,

  -- Role 1: When order was placed
  order_date_key INT,

  -- Role 2: When order was shipped
  ship_date_key INT,

  -- Role 3: When order was delivered
  delivery_date_key INT,

  customer_key INT,
  amount DECIMAL,

  CONSTRAINT fk_order_date FOREIGN KEY (order_date_key) REFERENCES dim_date(date_key),
  CONSTRAINT fk_ship_date FOREIGN KEY (ship_date_key) REFERENCES dim_date(date_key),
  CONSTRAINT fk_delivery_date FOREIGN KEY (delivery_date_key) REFERENCES dim_date(date_key)
);

-- Query: Orders delivered late
SELECT
  f.order_key,
  do.date_actual as order_date,
  ds.date_actual as ship_date,
  dd.date_actual as delivery_date,
  dd.date_actual - ds.date_actual as delivery_delay_days
FROM fact_orders f
JOIN dim_date do ON f.order_date_key = do.date_key
JOIN dim_date ds ON f.ship_date_key = ds.date_key
JOIN dim_date dd ON f.delivery_date_key = dd.date_key
WHERE dd.date_actual - ds.date_actual > 7;  -- Delivered > 7 days late

Advanced Pattern 6: Slowly Changing Dimensions - Hybrid Approach

Problem: Track change history while maintaining query simplicity.

Pattern: SCD Type 2 with Current Flag (Best Practice)

CREATE TABLE dim_customer_scd2 (
  customer_key INT PRIMARY KEY,
  customer_id INT,
  name STRING,
  address STRING,
  city STRING,

  -- SCD Type 2 tracking
  effective_date DATE,
  end_date DATE,
  is_current BOOLEAN,
  change_reason STRING,  -- Why did it change?

  -- Lineage
  _dbt_updated_at TIMESTAMP,
  _version_number INT
);

-- Example data
INSERT INTO dim_customer_scd2 VALUES
  (1, 100, 'John Doe', '123 Old St', 'NYC', '2024-01-01', NULL, TRUE, 'Initial', NOW(), 1),
  (2, 100, 'John Doe', '456 New St', 'Boston', '2024-06-01', NULL, TRUE, 'Moved', NOW(), 2);

-- Query as of date (snapshot)
SELECT * FROM dim_customer_scd2
WHERE is_current = TRUE;  -- Latest version

-- Historical analysis
SELECT
  customer_id,
  address,
  city,
  effective_date,
  end_date
FROM dim_customer_scd2
WHERE customer_id = 100
ORDER BY effective_date;

Performance Optimization via Schema Design

Pattern: Denormalization for Speed

-- Rather than joining 5 tables every query:
-- ❌ SELECT d.year, p.category, SUM(f.amount)
--    FROM fact f
--    JOIN dim_date d, dim_product p, dim_customer c, dim_store s
--    WHERE ...

-- ✅ Denormalize common attributes into fact:
CREATE TABLE fact_sales_denormalized (
  sale_id INT,
  date DATE,
  year INT,           -- Denormalized from dim_date
  month INT,
  product_category STRING,  -- Denormalized from dim_product
  product_name STRING,
  customer_segment STRING,  -- Denormalized from dim_customer
  store_region STRING,      -- Denormalized from dim_store
  amount DECIMAL
);

-- Query is now simpler and faster
SELECT year, product_category, SUM(amount)
FROM fact_sales_denormalized
GROUP BY year, product_category;

Pattern: Surrogate Keys for Performance

-- Natural keys are semantic (business meaning)
CREATE TABLE dim_product_natural (
  product_code STRING,  -- Business key (e.g., "SKU-12345")
  product_name STRING,
  category STRING,
  price DECIMAL
);

-- Surrogate keys are numeric (faster joins)
CREATE TABLE dim_product_surrogate (
  product_key INT PRIMARY KEY,  -- Auto-increment
  product_code STRING,  -- Natural key (for traceability)
  product_name STRING,
  category STRING,
  price DECIMAL
);

-- Fact table uses surrogate keys
CREATE TABLE fact_sales (
  sale_id INT,
  product_key INT,  -- Numeric join is faster
  customer_key INT,
  amount DECIMAL
);

Schema Design Best Practices

Keep fact tables atomic (most granular grain) ✅ Use surrogate keys (INT/BIGINT instead of STRING) ✅ Maintain conformed dimensions (single source of truth) ✅ Document grain explicitly (fact table comment) ✅ Use SCD Type 2 for historical tracking ✅ Test with actual data volumes (1B+ rows)

Don't mix grains (atomic + aggregate in same table) ❌ Don't use natural keys in fact tables (too large, change over time) ❌ Don't create one dimension per fact (duplicates logic) ❌ Don't assume "right" schema (test with queries first)


Real-World Example: E-commerce Dimensional Model

Fact: fact_orders (grain: one row per order line item)
├─ Dimensions:
│  ├─ dim_customer (SCD Type 2)
│  ├─ dim_product (SCD Type 2)
│  ├─ dim_store (SCD Type 2)
│  ├─ dim_date (role: order_date, ship_date, delivery_date)
│  └─ dim_promotion (bridge table for many-to-many)
├─ Degenerate dimensions:
│  ├─ order_id
│  ├─ order_status
│  └─ payment_method

Fact: fact_marketing_spend (grain: one row per campaign-day)
├─ Dimensions:
│  ├─ dim_customer_segment (conformed)
│  ├─ dim_campaign
│  └─ dim_date

Key Takeaways

  1. Grain defines fact table — atomic vs. aggregate
  2. Degenerate dimensions avoid unnecessary foreign keys
  3. Bridge tables handle many-to-many relationships
  4. Conformed dimensions maintain consistency
  5. SCD Type 2 preserves history for accurate analysis
  6. Denormalization improves query performance