02 · Data Warehouse Architecture: Logical Design & Schema Patterns

Level: Intermediate Time to read: 25 min Pre-reading: 01 · Foundations After reading: You'll understand dimensional modeling, star/snowflake schemas, and design principles that make warehouses fast and maintainable.

From Operational to Analytical Schema

The Problem with Normalized Schemas in Analytics

Operational databases use normalization to avoid anomalies:

-- OLTP Normalized (3NF)
CREATE TABLE orders (
  order_id INT PRIMARY KEY,
  customer_id INT,
  order_date DATE,
  FOREIGN KEY (customer_id) REFERENCES customers(id)
);

CREATE TABLE order_items (
  item_id INT PRIMARY KEY,
  order_id INT,
  product_id INT,
  quantity INT,
  unit_price DECIMAL,
  FOREIGN KEY (order_id) REFERENCES orders(order_id),
  FOREIGN KEY (product_id) REFERENCES products(id)
);

-- Analytical query needs 5+ JOINs
SELECT 
  c.country,
  p.category,
  SUM(oi.quantity * oi.unit_price) as revenue
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE o.order_date >= '2024-01-01'
GROUP BY 1, 2;

The cost: Each JOIN requires disk I/O. With millions of rows, this is slow.

Solution: Denormalization via Dimensional Modeling

-- OLAP Denormalized (Dimensional Model)
CREATE TABLE fact_sales (
  sale_id INT PRIMARY KEY,
  customer_dim_id INT,
  product_dim_id INT,
  date_dim_id INT,
  amount DECIMAL,
  quantity INT,
  FOREIGN KEY (customer_dim_id) REFERENCES dim_customer,
  FOREIGN KEY (product_dim_id) REFERENCES dim_product,
  FOREIGN KEY (date_dim_id) REFERENCES dim_date
);

-- Analytical query is simpler and faster
SELECT 
  dc.country,
  dp.category,
  SUM(fs.amount) as revenue
FROM fact_sales fs
JOIN dim_customer dc ON fs.customer_dim_id = dc.id
JOIN dim_product dp ON fs.product_dim_id = dp.id
JOIN dim_date dd ON fs.date_dim_id = dd.id
WHERE dd.date_year = 2024
GROUP BY 1, 2;

Core Concepts: Facts & Dimensions

Facts: The Measurements

A fact is a business event or measurement:

Sales transaction (amount, quantity, cost)
Click event (user, product, timestamp, source)
Page load (duration, error code, region)

Characteristics:

✅ Additive (can be summed): quantity, amount, count
⚠️ Semi-additive (limited aggregation): balance (can sum by time dimension only)
❌ Non-additive (don't sum): price, average (derived)
Large (fact tables have billions of rows)
Sparse (columns are mostly populated)

CREATE TABLE fact_sales (
  sale_id INT,
  customer_dim_id INT,
  product_dim_id INT,
  store_dim_id INT,
  date_dim_id INT,

  -- Additive measures
  amount DECIMAL,         -- SUM(amount)
  quantity INT,           -- SUM(quantity)
  cost DECIMAL,           -- SUM(cost)
  discount_amount DECIMAL,-- SUM(discount)

  -- Non-additive
  unit_price DECIMAL,     -- Don't sum; use in joins only

  PRIMARY KEY (sale_id)
);

Dimensions: The Descriptive Attributes

A dimension describes a fact from one perspective:

Customer dimension: demographics, segment, loyalty tier
Product dimension: category, brand, size, color
Date dimension: day of week, month, quarter, fiscal period
Store dimension: location, district, region, manager

Characteristics:

✅ Descriptive (text, categorical)
Small to medium (thousands to millions of rows, not billions)
✅ Frequently filtered (WHERE, HAVING on dimensions)
Relatively static (updated daily at most)

CREATE TABLE dim_customer (
  customer_dim_id INT PRIMARY KEY,
  customer_id INT UNIQUE,  -- Surrogate vs natural key

  -- Descriptive attributes
  name VARCHAR,
  email VARCHAR,
  segment VARCHAR,         -- 'Premium' | 'Standard' | 'At-Risk'
  country VARCHAR,
  age_group VARCHAR,

  -- Surrogate key for time tracking
  effective_date DATE,     -- SCD Type 2: history tracking
  end_date DATE,
  is_current BOOLEAN
);

Schema Architectures

Star Schema ⭐️ (Most Common)

Structure: One central fact table surrounded by dimension tables.

graph TB
    DT["dim_date<br/>year, quarter,<br/>month, day_of_week"]
    DP["dim_product<br/>category, brand,<br/>color, size"]
    DC["dim_customer<br/>segment, country,<br/>age_group"]
    DS["dim_store<br/>region, district,<br/>manager"]
    FS["FACT_SALES<br/>amount, qty,<br/>cost, discount<br/>(Central)"]

    DT -.-> FS
    DP -.-> FS
    DC -.-> FS
    DS -.-> FS

Advantages

Advantage	Why It Matters
Simple queries	1 fact table JOIN 3-5 dimensions vs. 7+ normalized tables
Fast aggregations	Columnar databases compress dimensions efficiently
Balanced	Fact demoralization without excessive redundancy
Standard business vocabulary	Clear separation of facts and descriptive attributes
BI tool friendly	Business intelligence tools assume star schema

Example Query

-- Star schema: 1 fact JOIN 3 dimensions
SELECT 
  d.year,
  dc.country,
  dp.category,
  SUM(fs.amount) as revenue,
  SUM(fs.quantity) as units_sold,
  AVG(fs.unit_price) FILTER (WHERE fs.unit_price > 0) as avg_price
FROM fact_sales fs
JOIN dim_date d ON fs.date_dim_id = d.id
JOIN dim_customer dc ON fs.customer_dim_id = dc.id
JOIN dim_product dp ON fs.product_dim_id = dp.id
WHERE d.year = 2024
  AND dc.country = 'US'
GROUP BY 1, 2, 3
ORDER BY 4 DESC;

Snowflake Schema ❄️ (Normalized Dimensions)

Structure: Fact table with hierarchically normalized dimensions.

graph TB
    DR["dim_region<br/>region_id, region"]
    DD["dim_district<br/>district_id, district<br/>region_id"]
    DS["dim_store<br/>store_id, store_name,<br/>district_id"]

    DT["dim_date<br/>year, quarter,<br/>month, day"]
    FS["FACT_SALES<br/>(Central)"]

    DR --> DD
    DD --> DS
    DT -.-> FS
    DS -.-> FS

Advantages

Advantage	Detail
Less redundancy	Dimension attributes (e.g., region) stored once, not repeated
Easier maintenance	If region name changes, update once
Normalized consistency	Follows 3NF on dimension side

Disadvantages

Example Query

-- Snowflake: 1 fact + 3+ normalized dimension JOINs
SELECT 
  d.year,
  r.region,
  dp.category,
  SUM(fs.amount) as revenue
FROM fact_sales fs
JOIN dim_date d ON fs.date_dim_id = d.id
JOIN dim_store ds ON fs.store_dim_id = ds.id
JOIN dim_district dd ON ds.district_id = dd.id
JOIN dim_region dr ON dd.region_id = dr.id  -- Extra JOIN!
JOIN dim_product dp ON fs.product_dim_id = dp.id
WHERE d.year = 2024
GROUP BY 1, 2, 3;

Star vs. Snowflake: When to Choose

Criterion	Star Schema	Snowflake Schema
Storage efficiency	❌ Redundant dimension data	✅ Normalized, compact
Query performance	✅ Fewer JOINs, faster	❌ More JOINs, slower
Maintenance burden	⚠️ Update redundant values	✅ Single source per attribute
BI tool compatibility	✅ Expected model	⚠️ Needs special config
Recommended for	Modern OLAP systems	Legacy or specialized cases

Modern consensus: Star schema is preferred 95% of the time.

Columnar databases (BigQuery, Snowflake, Databricks) compress redundancy efficiently
BI tools expect star schemas
Query performance is more important than storage redundancy

Surrogate vs. Natural Keys

Natural Keys (Business Keys)

D IM_PRODUCT (
  product_code VARCHAR PRIMARY KEY,  -- "SKU-12345"
  product_name VARCHAR,
  category VARCHAR
);

FACT_SALES (
  product_code VARCHAR REFERENCES dim_product(product_code),
  amount DECIMAL
);

Pros: Meaningful, readable, no extra mapping table
Cons: If business key changes (rare but happens), JOIN logic breaks

Surrogate Keys (Artificial Keys)

DIM_PRODUCT (
  product_dim_id INT PRIMARY KEY,    -- 1, 2, 3, ...
  product_code VARCHAR UNIQUE,       -- "SKU-12345"
  product_name VARCHAR,
  category VARCHAR
);

FACT_SALES (
  product_dim_id INT REFERENCES dim_product(product_dim_id),
  amount DECIMAL
);

Pros: Immutable, supports SCD Type 2 (history), compact integers, flexible
Cons: Extra lookup table needed, one level of indirection

Modern best practice: Use surrogate keys in dimensions → enables SCD Type 2.

Fact Table Types

Transaction Facts

Each row = one business transaction.

Granularity: One row per transaction
Frequency: Real-time or daily
Example: fact_sales (one row per order line item)

Periodic Snapshot Facts

Each row = aggregated state at a point in time.

Granularity: One row per account + time period
Frequency: Daily or monthly
Example: fact_account_balance (daily balance per account)

Cumulative Facts

Each row = lifetime cumulative measurements.

Granularity: One row per fact occurrence
Frequency: One-time at completion
Example: fact_order_fulfillment (one row per order, updated as it progresses)

Summary Table

Concept	Star	Snowflake	Notes
Dimension normalization	Denormalized	Normalized	Star is standard
JOINs needed	Fewer (1 fact + 3-5 dims)	More (1 fact + 5-10+ tables)	More JOINs = slower
Redundancy	High (intentional)	Low	Trade-off: redundancy vs. joins
BI tool support	✅ Native	⚠️ Manual config	BI tools assume star
Recommended	✅ For modern OLAP	❌ Rarely needed	Unless storage is critical

Deep-Dives

→ Deep Dive: Dimensional Modeling — Advanced dimensional techniques (conformed dimensions, bridge tables, SCD types).

→ Deep Dive: Schema Design — Practical schema design for different business domains.

→ Deep Dive: Data Warehouse Architecture — How medallion layers organize fact/dimension tables.

Key Takeaways

✅ Dimensional modeling (star schema) is the dominant pattern for OLAP systems.

✅ Facts are additive measurements; dimensions are descriptive attributes.

✅ Star schema wins on query performance; snowflake schema wins on storage (rarely needed in modern systems).

✅ Surrogate keys are essential for tracking dimension history (SCD Type 2).

✅ Denormalization is intentional in data warehouses — trade redundancy for query speed.

Practice Questions

Why does dimensional modeling work better than normalization for analytics?
What's the difference between an additive and non-additive measure? Give examples.
When would you choose snowflake schema over star? (Hint: rarely, and only if storage cost dominates.)
How does a surrogate key enable tracking dimension history?
What's the trade-off between query performance and storage efficiency in star vs. snowflake?

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search