02 · Data Warehouse Architecture: Logical Design & Schema Patterns
Level: Intermediate Time to read: 25 min Pre-reading: 01 · Foundations After reading: You'll understand dimensional modeling, star/snowflake schemas, and design principles that make warehouses fast and maintainable.
From Operational to Analytical Schema
The Problem with Normalized Schemas in Analytics
Operational databases use normalization to avoid anomalies:
-- OLTP Normalized (3NF)
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
FOREIGN KEY (customer_id) REFERENCES customers(id)
);
CREATE TABLE order_items (
item_id INT PRIMARY KEY,
order_id INT,
product_id INT,
quantity INT,
unit_price DECIMAL,
FOREIGN KEY (order_id) REFERENCES orders(order_id),
FOREIGN KEY (product_id) REFERENCES products(id)
);
-- Analytical query needs 5+ JOINs
SELECT
c.country,
p.category,
SUM(oi.quantity * oi.unit_price) as revenue
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE o.order_date >= '2024-01-01'
GROUP BY 1, 2;
The cost: Each JOIN requires disk I/O. With millions of rows, this is slow.
Solution: Denormalization via Dimensional Modeling
-- OLAP Denormalized (Dimensional Model)
CREATE TABLE fact_sales (
sale_id INT PRIMARY KEY,
customer_dim_id INT,
product_dim_id INT,
date_dim_id INT,
amount DECIMAL,
quantity INT,
FOREIGN KEY (customer_dim_id) REFERENCES dim_customer,
FOREIGN KEY (product_dim_id) REFERENCES dim_product,
FOREIGN KEY (date_dim_id) REFERENCES dim_date
);
-- Analytical query is simpler and faster
SELECT
dc.country,
dp.category,
SUM(fs.amount) as revenue
FROM fact_sales fs
JOIN dim_customer dc ON fs.customer_dim_id = dc.id
JOIN dim_product dp ON fs.product_dim_id = dp.id
JOIN dim_date dd ON fs.date_dim_id = dd.id
WHERE dd.date_year = 2024
GROUP BY 1, 2;
Core Concepts: Facts & Dimensions
Facts: The Measurements
A fact is a business event or measurement:
- Sales transaction (amount, quantity, cost)
- Click event (user, product, timestamp, source)
- Page load (duration, error code, region)
Characteristics:
- ✅ Additive (can be summed): quantity, amount, count
- ⚠️ Semi-additive (limited aggregation): balance (can sum by time dimension only)
- ❌ Non-additive (don't sum): price, average (derived)
- Large (fact tables have billions of rows)
- Sparse (columns are mostly populated)
CREATE TABLE fact_sales (
sale_id INT,
customer_dim_id INT,
product_dim_id INT,
store_dim_id INT,
date_dim_id INT,
-- Additive measures
amount DECIMAL, -- SUM(amount)
quantity INT, -- SUM(quantity)
cost DECIMAL, -- SUM(cost)
discount_amount DECIMAL,-- SUM(discount)
-- Non-additive
unit_price DECIMAL, -- Don't sum; use in joins only
PRIMARY KEY (sale_id)
);
Dimensions: The Descriptive Attributes
A dimension describes a fact from one perspective:
- Customer dimension: demographics, segment, loyalty tier
- Product dimension: category, brand, size, color
- Date dimension: day of week, month, quarter, fiscal period
- Store dimension: location, district, region, manager
Characteristics:
- ✅ Descriptive (text, categorical)
- Small to medium (thousands to millions of rows, not billions)
- ✅ Frequently filtered (WHERE, HAVING on dimensions)
- Relatively static (updated daily at most)
CREATE TABLE dim_customer (
customer_dim_id INT PRIMARY KEY,
customer_id INT UNIQUE, -- Surrogate vs natural key
-- Descriptive attributes
name VARCHAR,
email VARCHAR,
segment VARCHAR, -- 'Premium' | 'Standard' | 'At-Risk'
country VARCHAR,
age_group VARCHAR,
-- Surrogate key for time tracking
effective_date DATE, -- SCD Type 2: history tracking
end_date DATE,
is_current BOOLEAN
);
Schema Architectures
Star Schema ⭐️ (Most Common)
Structure: One central fact table surrounded by dimension tables.
graph TB
DT["dim_date<br/>year, quarter,<br/>month, day_of_week"]
DP["dim_product<br/>category, brand,<br/>color, size"]
DC["dim_customer<br/>segment, country,<br/>age_group"]
DS["dim_store<br/>region, district,<br/>manager"]
FS["FACT_SALES<br/>amount, qty,<br/>cost, discount<br/>(Central)"]
DT -.-> FS
DP -.-> FS
DC -.-> FS
DS -.-> FS
Advantages
| Advantage | Why It Matters |
|---|---|
| Simple queries | 1 fact table JOIN 3-5 dimensions vs. 7+ normalized tables |
| Fast aggregations | Columnar databases compress dimensions efficiently |
| Balanced | Fact demoralization without excessive redundancy |
| Standard business vocabulary | Clear separation of facts and descriptive attributes |
| BI tool friendly | Business intelligence tools assume star schema |
Example Query
-- Star schema: 1 fact JOIN 3 dimensions
SELECT
d.year,
dc.country,
dp.category,
SUM(fs.amount) as revenue,
SUM(fs.quantity) as units_sold,
AVG(fs.unit_price) FILTER (WHERE fs.unit_price > 0) as avg_price
FROM fact_sales fs
JOIN dim_date d ON fs.date_dim_id = d.id
JOIN dim_customer dc ON fs.customer_dim_id = dc.id
JOIN dim_product dp ON fs.product_dim_id = dp.id
WHERE d.year = 2024
AND dc.country = 'US'
GROUP BY 1, 2, 3
ORDER BY 4 DESC;
Snowflake Schema ❄️ (Normalized Dimensions)
Structure: Fact table with hierarchically normalized dimensions.
graph TB
DR["dim_region<br/>region_id, region"]
DD["dim_district<br/>district_id, district<br/>region_id"]
DS["dim_store<br/>store_id, store_name,<br/>district_id"]
DT["dim_date<br/>year, quarter,<br/>month, day"]
FS["FACT_SALES<br/>(Central)"]
DR --> DD
DD --> DS
DT -.-> FS
DS -.-> FS
Advantages
| Advantage | Detail |
|---|---|
| Less redundancy | Dimension attributes (e.g., region) stored once, not repeated |
| Easier maintenance | If region name changes, update once |
| Normalized consistency | Follows 3NF on dimension side |
Disadvantages
| Disadvantage | Impact | |--| | More JOINs required | 1 fact + 3 normalized dimension JOINs = slower | | Harder to optimize | Query optimizer struggles with deep hierarchies | | BI tools struggle | Not the expected schema structure |
Example Query
-- Snowflake: 1 fact + 3+ normalized dimension JOINs
SELECT
d.year,
r.region,
dp.category,
SUM(fs.amount) as revenue
FROM fact_sales fs
JOIN dim_date d ON fs.date_dim_id = d.id
JOIN dim_store ds ON fs.store_dim_id = ds.id
JOIN dim_district dd ON ds.district_id = dd.id
JOIN dim_region dr ON dd.region_id = dr.id -- Extra JOIN!
JOIN dim_product dp ON fs.product_dim_id = dp.id
WHERE d.year = 2024
GROUP BY 1, 2, 3;
Star vs. Snowflake: When to Choose
| Criterion | Star Schema | Snowflake Schema |
|---|---|---|
| Storage efficiency | ❌ Redundant dimension data | ✅ Normalized, compact |
| Query performance | ✅ Fewer JOINs, faster | ❌ More JOINs, slower |
| Maintenance burden | ⚠️ Update redundant values | ✅ Single source per attribute |
| BI tool compatibility | ✅ Expected model | ⚠️ Needs special config |
| Recommended for | Modern OLAP systems | Legacy or specialized cases |
Modern consensus: Star schema is preferred 95% of the time.
- Columnar databases (BigQuery, Snowflake, Databricks) compress redundancy efficiently
- BI tools expect star schemas
- Query performance is more important than storage redundancy
Surrogate vs. Natural Keys
Natural Keys (Business Keys)
D IM_PRODUCT (
product_code VARCHAR PRIMARY KEY, -- "SKU-12345"
product_name VARCHAR,
category VARCHAR
);
FACT_SALES (
product_code VARCHAR REFERENCES dim_product(product_code),
amount DECIMAL
);
Pros: Meaningful, readable, no extra mapping table
Cons: If business key changes (rare but happens), JOIN logic breaks
Surrogate Keys (Artificial Keys)
DIM_PRODUCT (
product_dim_id INT PRIMARY KEY, -- 1, 2, 3, ...
product_code VARCHAR UNIQUE, -- "SKU-12345"
product_name VARCHAR,
category VARCHAR
);
FACT_SALES (
product_dim_id INT REFERENCES dim_product(product_dim_id),
amount DECIMAL
);
Pros: Immutable, supports SCD Type 2 (history), compact integers, flexible
Cons: Extra lookup table needed, one level of indirection
Modern best practice: Use surrogate keys in dimensions → enables SCD Type 2.
Fact Table Types
Transaction Facts
Each row = one business transaction.
- Granularity: One row per transaction
- Frequency: Real-time or daily
- Example: fact_sales (one row per order line item)
Periodic Snapshot Facts
Each row = aggregated state at a point in time.
- Granularity: One row per account + time period
- Frequency: Daily or monthly
- Example: fact_account_balance (daily balance per account)
Cumulative Facts
Each row = lifetime cumulative measurements.
- Granularity: One row per fact occurrence
- Frequency: One-time at completion
- Example: fact_order_fulfillment (one row per order, updated as it progresses)
Summary Table
| Concept | Star | Snowflake | Notes |
|---|---|---|---|
| Dimension normalization | Denormalized | Normalized | Star is standard |
| JOINs needed | Fewer (1 fact + 3-5 dims) | More (1 fact + 5-10+ tables) | More JOINs = slower |
| Redundancy | High (intentional) | Low | Trade-off: redundancy vs. joins |
| BI tool support | ✅ Native | ⚠️ Manual config | BI tools assume star |
| Recommended | ✅ For modern OLAP | ❌ Rarely needed | Unless storage is critical |
Deep-Dives
→ Deep Dive: Dimensional Modeling — Advanced dimensional techniques (conformed dimensions, bridge tables, SCD types).
→ Deep Dive: Schema Design — Practical schema design for different business domains.
→ Deep Dive: Data Warehouse Architecture — How medallion layers organize fact/dimension tables.
Key Takeaways
✅ Dimensional modeling (star schema) is the dominant pattern for OLAP systems.
✅ Facts are additive measurements; dimensions are descriptive attributes.
✅ Star schema wins on query performance; snowflake schema wins on storage (rarely needed in modern systems).
✅ Surrogate keys are essential for tracking dimension history (SCD Type 2).
✅ Denormalization is intentional in data warehouses — trade redundancy for query speed.
Practice Questions
- Why does dimensional modeling work better than normalization for analytics?
- What's the difference between an additive and non-additive measure? Give examples.
- When would you choose snowflake schema over star? (Hint: rarely, and only if storage cost dominates.)
- How does a surrogate key enable tracking dimension history?
- What's the trade-off between query performance and storage efficiency in star vs. snowflake?