Deep Dive: SCD Type 2 & Advanced Dimension Modeling

Level: Intermediate to Advanced Pre-reading: 05 · Data Modeling Time to read: 20 min


SCD Type 2 In-Depth: Slowly Changing Dimensions

The Core Problem Revisited

Dimensions (unlike facts) change over time. An SCD strategy determines:

  1. Do we keep history?
  2. How many versions do we keep?
  3. How do we join facts to the right version?

Complete SCD Type 2 Implementation

-- Dimension table (initially empty)
CREATE TABLE gold.dim_customer (
  -- Surrogate key (auto-incrementing)
  customer_key INT PRIMARY KEY AUTO_INCREMENT,

  -- Business key (immutable identifier)
  customer_id INT NOT NULL,

  -- Dimension attributes
  customer_name VARCHAR(500),
  email VARCHAR(255),
  segment VARCHAR(50),  -- 'Premium' | 'Standard' | 'At-Risk'
  country_code VARCHAR(2),

  -- SCD Type 2: Time tracking
  effective_date DATE NOT NULL,      -- When this version became active
  end_date DATE,                     -- When this version stopped being active
  is_current BOOLEAN NOT NULL,       -- TRUE = current version, FALSE = historical

  -- Metadata
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Index for fast lookups
CREATE INDEX idx_customer_id_date ON gold.dim_customer(customer_id, effective_date, end_date);
CREATE INDEX idx_is_current ON gold.dim_customer(customer_id, is_current);

Example Data Flow

Source (Silver):
  customer_id=5, segment='Standard' (2024-01-15)

Insert into Gold (SCD Type 2):
├─ 2024-01-15: INSERT
│  └─ dim_customer_key = 100
│  └─ customer_id = 5
│  └─ segment = 'Standard'
│  └─ effective_date = 2024-01-15
│  └─ end_date = NULL
│  └─ is_current = TRUE
Update (2024-07-20): Segment changes to 'Premium'
├─ CLOSES old version
│  └─ dim_customer_key = 100
│  └─ end_date = 2024-07-20
│  └─ is_current = FALSE
├─ INSERTS new version
│  └─ dim_customer_key = 101
│  └─ customer_id = 5
│  └─ segment = 'Premium'
│  └─ effective_date = 2024-07-20
│  └─ end_date = NULL
│  └─ is_current = TRUE

Query: Sales by segment for historical accuracy
├─ Sales on 2024-02-01
│  └─ JOIN fact_sales with dim_customer
│  └─ fact.sale_date >= dim.effective_date AND fact.sale_date < dim.end_date
│  └─ Matches: dim_customer_key = 100 (segment = 'Standard')
├─ Sales on 2024-08-01
│  └─ Same JOIN logic
│  └─ Matches: dim_customer_key = 101 (segment = 'Premium')

The MERGE Implementation (Idempotent)

-- Idempotent SCD Type 2 using MERGE
-- Running twice = same result (no duplicates)

MERGE INTO gold.dim_customer t
USING (
  SELECT 
    customer_id,
    customer_name,
    email,
    segment,
    country_code,
    CURRENT_DATE() as load_date
  FROM silver.customers
) s
ON t.customer_id = s.customer_id 
  AND t.is_current = TRUE  -- Only match current version
WHEN MATCHED AND (
  -- Detect any attribute change
  t.customer_name != s.customer_name
  OR t.email != s.email
  OR t.segment != s.segment
  OR t.country_code != s.country_code
) THEN
  -- Close old version
  UPDATE SET 
    is_current = FALSE,
    end_date = s.load_date
WHEN NOT MATCHED THEN
  -- Insert new version
  INSERT (
    customer_id,
    customer_name,
    email,
    segment,
    country_code,
    effective_date,
    end_date,
    is_current
  ) VALUES (
    s.customer_id,
    s.customer_name,
    s.email,
    s.segment,
    s.country_code,
    s.load_date,
    NULL,
    TRUE
  );

Querying SCD Type 2 Dimensions (Critical!)

Query 1: Current state (as-is)

-- What's customer segment TODAY?
SELECT 
  customer_id,
  segment
FROM gold.dim_customer
WHERE is_current = TRUE;

-- Simple: Just filter is_current = TRUE

Query 2: Historical accuracy (with facts)

-- Revenue by customer segment (historically accurate)
SELECT 
  dim.segment,
  SUM(fact.amount) as revenue,
  COUNT(DISTINCT fact.sale_id) as num_sales
FROM fact_sales fact
JOIN gold.dim_customer dim ON fact.customer_Key = dim.customer_key
WHERE fact.sale_date >= DATE_SUB(CURRENT_DATE, INTERVAL 2 YEAR)
GROUP BY 1
ORDER BY 2 DESC;

-- Implicitly accurate because:
--  fact.sale_date >= dim.effective_date AND fact.sale_date <= dim.end_date
--  (fact's foreign key points to correct dim version)

Query 3: Audit trail (track changes)

-- Show all versions of customer 5
SELECT 
  customer_key,
  customer_id,
  segment,
  effective_date,
  end_date,
  is_current,
  DATEDIFF(day, effective_date, COALESCE(end_date, CURRENT_DATE)) as days_active
FROM gold.dim_customer
WHERE customer_id = 5
ORDER BY effective_date;

-- Output:
-- | 100 | 5 | Standard | 2024-01-15 | 2024-07-20 | FALSE | 187 |
-- | 101 | 5 | Premium  | 2024-07-20 | NULL       | TRUE  | 157 |

Advanced Dimension Modeling

Bridge Tables (Many-to-Many)

When a dimension has many-to-many relationship with fact:

Example: Customer enrolls in multiple loyalty programs
  ├─ Customer 5 → Program A, Program B, Program C
  ├─ Customer 6 → Program B, Program D
  └─ Problem: Star schema assumes one program per customer!

Solution: Bridge table
-- Bridge table: Links customer to multiple programs
CREATE TABLE bridge_customer_programs (
  bridge_key INT PRIMARY KEY,
  customer_key INT,  -- FK to dim_customer
  program_key INT,   -- FK to dim_program
  effective_date DATE,
  end_date DATE,
  is_current BOOLEAN,

  UNIQUE(customer_key, program_key, effective_date)
);

-- Usage in fact table
CREATE TABLE fact_purchases (
  purchase_key INT,
  bridge_key INT,     -- FK to bridge (instead of direct dim)
  date_key INT,
  amount DECIMAL,

  FOREIGN KEY (bridge_key) REFERENCES bridge_customer_programs
);

-- Query: Purchases by program
SELECT 
  p.program_name,
  SUM(f.amount) as revenue
FROM fact_purchases f
JOIN bridge_customer_programs b ON f.bridge_key = b.bridge_key
JOIN dim_program p ON b.program_key = p.program_key
WHERE b.is_current = TRUE
GROUP BY 1;

Conformed Dimensions (Revisited)

Same dimension used across multiple fact tables:

Good: All uses of "customer" reference gold.dim_customer
Bad: fact_sales has sales.customer dimension, fact_support has support.customer (different!)

Solution: Share single conformed dimension
-- ✅ CONFORMED: One dimension, many facts
DIM_CUSTOMER
  ├─ Used by: fact_sales, fact_support, fact_marketing
  ├─ Ensures: All reports use consistent customer definition
  └─ Query: Can aggregate across fact tables without reconciliation

-- ✅ Example: Multi-domain query
SELECT 
  c.segment,
  SUM(fs.amount) as sales_revenue,
  SUM(fc.resolution_cost) as support_cost,
  (SUM(fs.amount) - SUM(fc.resolution_cost)) as net_profit
FROM dim_customer c
LEFT JOIN fact_sales fs ON c.customer_key = fs.customer_key
LEFT JOIN fact_support fc ON c.customer_key = fc.customer_key
WHERE c.is_current = TRUE
  AND YEAR(fs.sale_date) = 2024
GROUP BY 1
ORDER BY 4 DESC;

SCD Type 2 Performance Considerations

Storage Impact

1M customers, 5 years data:

Data Vault (normal dimension):
  └─ 1M rows × 200 bytes = 200 MB

SCD Type 2 with changes:
  └─ 1M original + 300K changes = 1.3M rows
  └─ 1.3M rows × 200 bytes = 260 MB

Increase: ~30% (acceptable for historical value)

Query Performance

-- Slow: Full scan to find current version
SELECT segment FROM dim_customer 
WHERE customer_id = 5 AND end_date IS NULL;
 Full table scan (1.3M rows)

-- Fast: Use is_current flag
SELECT segment FROM dim_customer 
WHERE customer_id = 5 AND is_current = TRUE;
 Index on (customer_id, is_current)

Index recommendation:

CREATE INDEX idx_customer_id_is_current 
ON dim_customer(customer_id, is_current);

-- Lookup current customer: O(1) via index
-- Lookup historical: Full scan needed (acceptable for audits)


Key Takeaways

SCD Type 2 requires surrogate keys to work correctly.

MERGE statement is idempotent (critical for safe retries).

Fact tables join to dimension via surrogate key (not business key).

Queries must account for temporal accuracy (fact_date >= eff_date AND fact_date < end_date implicit in PK join).

Bridge tables handle many-to-many dimensions.

Conformed dimensions ensure consistency across facts.


Practice Questions

  1. Why use surrogate keys for SCD Type 2?
  2. How does MERGE ensure idempotency?
  3. What breaks if you miss the is_current check?
  4. Design a conformed dim_date for multiple fact tables.
  5. When would you use a bridge table instead of snow-flaking?