Deep Dive: SCD Type 2 & Advanced Dimension Modeling
Level: Intermediate to Advanced Pre-reading: 05 · Data Modeling Time to read: 20 min
SCD Type 2 In-Depth: Slowly Changing Dimensions
The Core Problem Revisited
Dimensions (unlike facts) change over time. An SCD strategy determines:
- Do we keep history?
- How many versions do we keep?
- How do we join facts to the right version?
Complete SCD Type 2 Implementation
-- Dimension table (initially empty)
CREATE TABLE gold.dim_customer (
-- Surrogate key (auto-incrementing)
customer_key INT PRIMARY KEY AUTO_INCREMENT,
-- Business key (immutable identifier)
customer_id INT NOT NULL,
-- Dimension attributes
customer_name VARCHAR(500),
email VARCHAR(255),
segment VARCHAR(50), -- 'Premium' | 'Standard' | 'At-Risk'
country_code VARCHAR(2),
-- SCD Type 2: Time tracking
effective_date DATE NOT NULL, -- When this version became active
end_date DATE, -- When this version stopped being active
is_current BOOLEAN NOT NULL, -- TRUE = current version, FALSE = historical
-- Metadata
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Index for fast lookups
CREATE INDEX idx_customer_id_date ON gold.dim_customer(customer_id, effective_date, end_date);
CREATE INDEX idx_is_current ON gold.dim_customer(customer_id, is_current);
Example Data Flow
Source (Silver):
customer_id=5, segment='Standard' (2024-01-15)
Insert into Gold (SCD Type 2):
│
├─ 2024-01-15: INSERT
│ └─ dim_customer_key = 100
│ └─ customer_id = 5
│ └─ segment = 'Standard'
│ └─ effective_date = 2024-01-15
│ └─ end_date = NULL
│ └─ is_current = TRUE
│
Update (2024-07-20): Segment changes to 'Premium'
│
├─ CLOSES old version
│ └─ dim_customer_key = 100
│ └─ end_date = 2024-07-20
│ └─ is_current = FALSE
│
├─ INSERTS new version
│ └─ dim_customer_key = 101
│ └─ customer_id = 5
│ └─ segment = 'Premium'
│ └─ effective_date = 2024-07-20
│ └─ end_date = NULL
│ └─ is_current = TRUE
Query: Sales by segment for historical accuracy
│
├─ Sales on 2024-02-01
│ └─ JOIN fact_sales with dim_customer
│ └─ fact.sale_date >= dim.effective_date AND fact.sale_date < dim.end_date
│ └─ Matches: dim_customer_key = 100 (segment = 'Standard')
│
├─ Sales on 2024-08-01
│ └─ Same JOIN logic
│ └─ Matches: dim_customer_key = 101 (segment = 'Premium')
The MERGE Implementation (Idempotent)
-- Idempotent SCD Type 2 using MERGE
-- Running twice = same result (no duplicates)
MERGE INTO gold.dim_customer t
USING (
SELECT
customer_id,
customer_name,
email,
segment,
country_code,
CURRENT_DATE() as load_date
FROM silver.customers
) s
ON t.customer_id = s.customer_id
AND t.is_current = TRUE -- Only match current version
WHEN MATCHED AND (
-- Detect any attribute change
t.customer_name != s.customer_name
OR t.email != s.email
OR t.segment != s.segment
OR t.country_code != s.country_code
) THEN
-- Close old version
UPDATE SET
is_current = FALSE,
end_date = s.load_date
WHEN NOT MATCHED THEN
-- Insert new version
INSERT (
customer_id,
customer_name,
email,
segment,
country_code,
effective_date,
end_date,
is_current
) VALUES (
s.customer_id,
s.customer_name,
s.email,
s.segment,
s.country_code,
s.load_date,
NULL,
TRUE
);
Querying SCD Type 2 Dimensions (Critical!)
Query 1: Current state (as-is)
-- What's customer segment TODAY?
SELECT
customer_id,
segment
FROM gold.dim_customer
WHERE is_current = TRUE;
-- Simple: Just filter is_current = TRUE
Query 2: Historical accuracy (with facts)
-- Revenue by customer segment (historically accurate)
SELECT
dim.segment,
SUM(fact.amount) as revenue,
COUNT(DISTINCT fact.sale_id) as num_sales
FROM fact_sales fact
JOIN gold.dim_customer dim ON fact.customer_Key = dim.customer_key
WHERE fact.sale_date >= DATE_SUB(CURRENT_DATE, INTERVAL 2 YEAR)
GROUP BY 1
ORDER BY 2 DESC;
-- Implicitly accurate because:
-- fact.sale_date >= dim.effective_date AND fact.sale_date <= dim.end_date
-- (fact's foreign key points to correct dim version)
Query 3: Audit trail (track changes)
-- Show all versions of customer 5
SELECT
customer_key,
customer_id,
segment,
effective_date,
end_date,
is_current,
DATEDIFF(day, effective_date, COALESCE(end_date, CURRENT_DATE)) as days_active
FROM gold.dim_customer
WHERE customer_id = 5
ORDER BY effective_date;
-- Output:
-- | 100 | 5 | Standard | 2024-01-15 | 2024-07-20 | FALSE | 187 |
-- | 101 | 5 | Premium | 2024-07-20 | NULL | TRUE | 157 |
Advanced Dimension Modeling
Bridge Tables (Many-to-Many)
When a dimension has many-to-many relationship with fact:
Example: Customer enrolls in multiple loyalty programs
├─ Customer 5 → Program A, Program B, Program C
├─ Customer 6 → Program B, Program D
└─ Problem: Star schema assumes one program per customer!
Solution: Bridge table
-- Bridge table: Links customer to multiple programs
CREATE TABLE bridge_customer_programs (
bridge_key INT PRIMARY KEY,
customer_key INT, -- FK to dim_customer
program_key INT, -- FK to dim_program
effective_date DATE,
end_date DATE,
is_current BOOLEAN,
UNIQUE(customer_key, program_key, effective_date)
);
-- Usage in fact table
CREATE TABLE fact_purchases (
purchase_key INT,
bridge_key INT, -- FK to bridge (instead of direct dim)
date_key INT,
amount DECIMAL,
FOREIGN KEY (bridge_key) REFERENCES bridge_customer_programs
);
-- Query: Purchases by program
SELECT
p.program_name,
SUM(f.amount) as revenue
FROM fact_purchases f
JOIN bridge_customer_programs b ON f.bridge_key = b.bridge_key
JOIN dim_program p ON b.program_key = p.program_key
WHERE b.is_current = TRUE
GROUP BY 1;
Conformed Dimensions (Revisited)
Same dimension used across multiple fact tables:
Good: All uses of "customer" reference gold.dim_customer
Bad: fact_sales has sales.customer dimension, fact_support has support.customer (different!)
Solution: Share single conformed dimension
-- ✅ CONFORMED: One dimension, many facts
DIM_CUSTOMER
├─ Used by: fact_sales, fact_support, fact_marketing
├─ Ensures: All reports use consistent customer definition
└─ Query: Can aggregate across fact tables without reconciliation
-- ✅ Example: Multi-domain query
SELECT
c.segment,
SUM(fs.amount) as sales_revenue,
SUM(fc.resolution_cost) as support_cost,
(SUM(fs.amount) - SUM(fc.resolution_cost)) as net_profit
FROM dim_customer c
LEFT JOIN fact_sales fs ON c.customer_key = fs.customer_key
LEFT JOIN fact_support fc ON c.customer_key = fc.customer_key
WHERE c.is_current = TRUE
AND YEAR(fs.sale_date) = 2024
GROUP BY 1
ORDER BY 4 DESC;
SCD Type 2 Performance Considerations
Storage Impact
1M customers, 5 years data:
Data Vault (normal dimension):
└─ 1M rows × 200 bytes = 200 MB
SCD Type 2 with changes:
└─ 1M original + 300K changes = 1.3M rows
└─ 1.3M rows × 200 bytes = 260 MB
Increase: ~30% (acceptable for historical value)
Query Performance
-- Slow: Full scan to find current version
SELECT segment FROM dim_customer
WHERE customer_id = 5 AND end_date IS NULL;
→ Full table scan (1.3M rows)
-- Fast: Use is_current flag
SELECT segment FROM dim_customer
WHERE customer_id = 5 AND is_current = TRUE;
→ Index on (customer_id, is_current)
Index recommendation:
CREATE INDEX idx_customer_id_is_current
ON dim_customer(customer_id, is_current);
-- Lookup current customer: O(1) via index
-- Lookup historical: Full scan needed (acceptable for audits)
Key Takeaways
✅ SCD Type 2 requires surrogate keys to work correctly.
✅ MERGE statement is idempotent (critical for safe retries).
✅ Fact tables join to dimension via surrogate key (not business key).
✅ Queries must account for temporal accuracy (fact_date >= eff_date AND fact_date < end_date implicit in PK join).
✅ Bridge tables handle many-to-many dimensions.
✅ Conformed dimensions ensure consistency across facts.
Practice Questions
- Why use surrogate keys for SCD Type 2?
- How does MERGE ensure idempotency?
- What breaks if you miss the
is_currentcheck? - Design a conformed
dim_datefor multiple fact tables. - When would you use a bridge table instead of snow-flaking?