02 · Data Warehouse Architecture: Logical Design & Schema Patterns

Level: Intermediate Time to read: 25 min Pre-reading: 01 · Foundations After reading: You'll understand dimensional modeling, star/snowflake schemas, and design principles that make warehouses fast and maintainable.


From Operational to Analytical Schema

The Problem with Normalized Schemas in Analytics

Operational databases use normalization to avoid anomalies:

-- OLTP Normalized (3NF)
CREATE TABLE orders (
  order_id INT PRIMARY KEY,
  customer_id INT,
  order_date DATE,
  FOREIGN KEY (customer_id) REFERENCES customers(id)
);

CREATE TABLE order_items (
  item_id INT PRIMARY KEY,
  order_id INT,
  product_id INT,
  quantity INT,
  unit_price DECIMAL,
  FOREIGN KEY (order_id) REFERENCES orders(order_id),
  FOREIGN KEY (product_id) REFERENCES products(id)
);

-- Analytical query needs 5+ JOINs
SELECT 
  c.country,
  p.category,
  SUM(oi.quantity * oi.unit_price) as revenue
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE o.order_date >= '2024-01-01'
GROUP BY 1, 2;

The cost: Each JOIN requires disk I/O. With millions of rows, this is slow.

Solution: Denormalization via Dimensional Modeling

-- OLAP Denormalized (Dimensional Model)
CREATE TABLE fact_sales (
  sale_id INT PRIMARY KEY,
  customer_dim_id INT,
  product_dim_id INT,
  date_dim_id INT,
  amount DECIMAL,
  quantity INT,
  FOREIGN KEY (customer_dim_id) REFERENCES dim_customer,
  FOREIGN KEY (product_dim_id) REFERENCES dim_product,
  FOREIGN KEY (date_dim_id) REFERENCES dim_date
);

-- Analytical query is simpler and faster
SELECT 
  dc.country,
  dp.category,
  SUM(fs.amount) as revenue
FROM fact_sales fs
JOIN dim_customer dc ON fs.customer_dim_id = dc.id
JOIN dim_product dp ON fs.product_dim_id = dp.id
JOIN dim_date dd ON fs.date_dim_id = dd.id
WHERE dd.date_year = 2024
GROUP BY 1, 2;

Core Concepts: Facts & Dimensions

Facts: The Measurements

A fact is a business event or measurement:

  • Sales transaction (amount, quantity, cost)
  • Click event (user, product, timestamp, source)
  • Page load (duration, error code, region)

Characteristics:

  • ✅ Additive (can be summed): quantity, amount, count
  • ⚠️ Semi-additive (limited aggregation): balance (can sum by time dimension only)
  • ❌ Non-additive (don't sum): price, average (derived)
  • Large (fact tables have billions of rows)
  • Sparse (columns are mostly populated)
CREATE TABLE fact_sales (
  sale_id INT,
  customer_dim_id INT,
  product_dim_id INT,
  store_dim_id INT,
  date_dim_id INT,

  -- Additive measures
  amount DECIMAL,         -- SUM(amount)
  quantity INT,           -- SUM(quantity)
  cost DECIMAL,           -- SUM(cost)
  discount_amount DECIMAL,-- SUM(discount)

  -- Non-additive
  unit_price DECIMAL,     -- Don't sum; use in joins only

  PRIMARY KEY (sale_id)
);

Dimensions: The Descriptive Attributes

A dimension describes a fact from one perspective:

  • Customer dimension: demographics, segment, loyalty tier
  • Product dimension: category, brand, size, color
  • Date dimension: day of week, month, quarter, fiscal period
  • Store dimension: location, district, region, manager

Characteristics:

  • ✅ Descriptive (text, categorical)
  • Small to medium (thousands to millions of rows, not billions)
  • ✅ Frequently filtered (WHERE, HAVING on dimensions)
  • Relatively static (updated daily at most)
CREATE TABLE dim_customer (
  customer_dim_id INT PRIMARY KEY,
  customer_id INT UNIQUE,  -- Surrogate vs natural key

  -- Descriptive attributes
  name VARCHAR,
  email VARCHAR,
  segment VARCHAR,         -- 'Premium' | 'Standard' | 'At-Risk'
  country VARCHAR,
  age_group VARCHAR,

  -- Surrogate key for time tracking
  effective_date DATE,     -- SCD Type 2: history tracking
  end_date DATE,
  is_current BOOLEAN
);

Schema Architectures

Star Schema ⭐️ (Most Common)

Structure: One central fact table surrounded by dimension tables.

graph TB
    DT["dim_date<br/>year, quarter,<br/>month, day_of_week"]
    DP["dim_product<br/>category, brand,<br/>color, size"]
    DC["dim_customer<br/>segment, country,<br/>age_group"]
    DS["dim_store<br/>region, district,<br/>manager"]
    FS["FACT_SALES<br/>amount, qty,<br/>cost, discount<br/>(Central)"]

    DT -.-> FS
    DP -.-> FS
    DC -.-> FS
    DS -.-> FS

Advantages

Advantage Why It Matters
Simple queries 1 fact table JOIN 3-5 dimensions vs. 7+ normalized tables
Fast aggregations Columnar databases compress dimensions efficiently
Balanced Fact demoralization without excessive redundancy
Standard business vocabulary Clear separation of facts and descriptive attributes
BI tool friendly Business intelligence tools assume star schema

Example Query

-- Star schema: 1 fact JOIN 3 dimensions
SELECT 
  d.year,
  dc.country,
  dp.category,
  SUM(fs.amount) as revenue,
  SUM(fs.quantity) as units_sold,
  AVG(fs.unit_price) FILTER (WHERE fs.unit_price > 0) as avg_price
FROM fact_sales fs
JOIN dim_date d ON fs.date_dim_id = d.id
JOIN dim_customer dc ON fs.customer_dim_id = dc.id
JOIN dim_product dp ON fs.product_dim_id = dp.id
WHERE d.year = 2024
  AND dc.country = 'US'
GROUP BY 1, 2, 3
ORDER BY 4 DESC;

Snowflake Schema ❄️ (Normalized Dimensions)

Structure: Fact table with hierarchically normalized dimensions.

graph TB
    DR["dim_region<br/>region_id, region"]
    DD["dim_district<br/>district_id, district<br/>region_id"]
    DS["dim_store<br/>store_id, store_name,<br/>district_id"]

    DT["dim_date<br/>year, quarter,<br/>month, day"]
    FS["FACT_SALES<br/>(Central)"]

    DR --> DD
    DD --> DS
    DT -.-> FS
    DS -.-> FS

Advantages

Advantage Detail
Less redundancy Dimension attributes (e.g., region) stored once, not repeated
Easier maintenance If region name changes, update once
Normalized consistency Follows 3NF on dimension side

Disadvantages

| Disadvantage | Impact | |--| | More JOINs required | 1 fact + 3 normalized dimension JOINs = slower | | Harder to optimize | Query optimizer struggles with deep hierarchies | | BI tools struggle | Not the expected schema structure |

Example Query

-- Snowflake: 1 fact + 3+ normalized dimension JOINs
SELECT 
  d.year,
  r.region,
  dp.category,
  SUM(fs.amount) as revenue
FROM fact_sales fs
JOIN dim_date d ON fs.date_dim_id = d.id
JOIN dim_store ds ON fs.store_dim_id = ds.id
JOIN dim_district dd ON ds.district_id = dd.id
JOIN dim_region dr ON dd.region_id = dr.id  -- Extra JOIN!
JOIN dim_product dp ON fs.product_dim_id = dp.id
WHERE d.year = 2024
GROUP BY 1, 2, 3;

Star vs. Snowflake: When to Choose

Criterion Star Schema Snowflake Schema
Storage efficiency ❌ Redundant dimension data ✅ Normalized, compact
Query performance ✅ Fewer JOINs, faster ❌ More JOINs, slower
Maintenance burden ⚠️ Update redundant values ✅ Single source per attribute
BI tool compatibility ✅ Expected model ⚠️ Needs special config
Recommended for Modern OLAP systems Legacy or specialized cases

Modern consensus: Star schema is preferred 95% of the time.

  • Columnar databases (BigQuery, Snowflake, Databricks) compress redundancy efficiently
  • BI tools expect star schemas
  • Query performance is more important than storage redundancy

Surrogate vs. Natural Keys

Natural Keys (Business Keys)

D IM_PRODUCT (
  product_code VARCHAR PRIMARY KEY,  -- "SKU-12345"
  product_name VARCHAR,
  category VARCHAR
);

FACT_SALES (
  product_code VARCHAR REFERENCES dim_product(product_code),
  amount DECIMAL
);

Pros: Meaningful, readable, no extra mapping table
Cons: If business key changes (rare but happens), JOIN logic breaks

Surrogate Keys (Artificial Keys)

DIM_PRODUCT (
  product_dim_id INT PRIMARY KEY,    -- 1, 2, 3, ...
  product_code VARCHAR UNIQUE,       -- "SKU-12345"
  product_name VARCHAR,
  category VARCHAR
);

FACT_SALES (
  product_dim_id INT REFERENCES dim_product(product_dim_id),
  amount DECIMAL
);

Pros: Immutable, supports SCD Type 2 (history), compact integers, flexible
Cons: Extra lookup table needed, one level of indirection

Modern best practice: Use surrogate keys in dimensions → enables SCD Type 2.


Fact Table Types

Transaction Facts

Each row = one business transaction.

  • Granularity: One row per transaction
  • Frequency: Real-time or daily
  • Example: fact_sales (one row per order line item)

Periodic Snapshot Facts

Each row = aggregated state at a point in time.

  • Granularity: One row per account + time period
  • Frequency: Daily or monthly
  • Example: fact_account_balance (daily balance per account)

Cumulative Facts

Each row = lifetime cumulative measurements.

  • Granularity: One row per fact occurrence
  • Frequency: One-time at completion
  • Example: fact_order_fulfillment (one row per order, updated as it progresses)

Summary Table

Concept Star Snowflake Notes
Dimension normalization Denormalized Normalized Star is standard
JOINs needed Fewer (1 fact + 3-5 dims) More (1 fact + 5-10+ tables) More JOINs = slower
Redundancy High (intentional) Low Trade-off: redundancy vs. joins
BI tool support ✅ Native ⚠️ Manual config BI tools assume star
Recommended ✅ For modern OLAP ❌ Rarely needed Unless storage is critical

Deep-Dives

Deep Dive: Dimensional Modeling — Advanced dimensional techniques (conformed dimensions, bridge tables, SCD types).

Deep Dive: Schema Design — Practical schema design for different business domains.

Deep Dive: Data Warehouse Architecture — How medallion layers organize fact/dimension tables.


Key Takeaways

Dimensional modeling (star schema) is the dominant pattern for OLAP systems.

Facts are additive measurements; dimensions are descriptive attributes.

Star schema wins on query performance; snowflake schema wins on storage (rarely needed in modern systems).

Surrogate keys are essential for tracking dimension history (SCD Type 2).

Denormalization is intentional in data warehouses — trade redundancy for query speed.


Practice Questions

  1. Why does dimensional modeling work better than normalization for analytics?
  2. What's the difference between an additive and non-additive measure? Give examples.
  3. When would you choose snowflake schema over star? (Hint: rarely, and only if storage cost dominates.)
  4. How does a surrogate key enable tracking dimension history?
  5. What's the trade-off between query performance and storage efficiency in star vs. snowflake?