10 · Hands-On Labs: Building Data Warehouse Components
Level: Practical / All Levels Time to read + do: 2-4 hours depending on labs chosen Expected outcome: Hands-on experience with medallion architecture, dimensional modeling, ETL patterns, and optimization.
Lab Flow Chart
graph TD
A["Lab 1: Dimensional Modeling<br/>Design a star schema"]
B["Lab 2: Medallion Architecture<br/>Bronze → Silver → Gold"]
C["Lab 3: SCD Type 2 Implementation<br/>Track customer changes"]
D["Lab 4: Query Optimization<br/>Index & partition strategies"]
E["Lab 5: ETL/ELT Pipeline<br/>dbt or Airflow"]
F["Lab 6: Data Quality<br/>Testing framework"]
A --> B
B --> C
C --> D
D --> E
E --> F
Lab 1: Dimensional Modeling (Beginner, 30 min)
Scenario
You work for an e-commerce company. Design a star schema for:
- Sales transactions
- Customer demographics
- Product catalog
- Store locations
- Date/time information
Deliverable
-- Submit SQL DDL creating:
-- 1. fact_sales (with surrogate key)
-- 2. dim_customer (with surrogate key, SCD Type 2)
-- 3. dim_product
-- 4. dim_store
-- 5. dim_date
-- Requirements:
-- - Use surrogate keys (INT auto-increment)
-- - Mark additive measures
-- - Include date/effective/end tracking columns for SCD
-- - Foreign key constraints fact → dimensions
Evaluation
✅ Dimensions are denormalized (not snowflaked)
✅ Fact table uses surrogate keys
✅ Additive measures identified correctly
✅ SCD Type 2 columns present (effective_date, end_date, is_current)
Lab 2: Medallion Architecture (Intermediate, 1 hour)
Scenario
Raw customer data arrives daily in CSV. Build the medallion pipeline:
Bronze: Load raw CSV as-is
Silver: Clean, deduplicate, validate
Gold: Dimensional customer dimension (SCD Type 2)
Data
customer_id,name,email,age,country
1,Alice Smith,alice@example.com,35,US
2,Bob Jones,bob@example.com,,UK
1,Alice Smith,alice@example.com,35,US # Duplicate!
3,Charlie Brown,charlie,45,US # Invalid email
Deliverable
-- Bronze Layer: Load as-is
CREATE TABLE bronze.customers (...);
INSERT INTO bronze.customers
SELECT * FROM your_csv;
-- Silver Layer: Clean & validate
CREATE TABLE silver.customers (...);
MERGE INTO silver.customers USING bronze.customers;
-- Deduplicate
-- Validate email format
-- Handle nulls
-- Gold Layer: Dimensional model (SCD Type 2)
CREATE TABLE gold.dim_customer (...);
MERGE INTO gold.dim_customer USING silver.customers;
-- Add surrogate key
-- Track history with effective_date
Evaluation
✅ Bronze is append-only copy
✅ Silver deduplicates and validates
✅ Gold uses SCD Type 2 with date tracking
✅ Proper data lineage (Bronze → Silver → Gold)
Lab 3: SCD Type 2 Implementation (Intermediate, 45 min)
Scenario
Customer "Alice" (ID=1) changes her segment from "Standard" to "Premium".
Implement SCD Type 2 to track this change.
Initial Data
Change Arrives
Deliverable
-- Implement MERGE statement that:
-- 1. Closes old record (is_current = FALSE, end_date = TODAY)
-- 2. Inserts new record (is_current = TRUE, end_date = NULL)
-- 3. Is idempotent (running twice = running once)
MERGE INTO gold.dim_customer t
USING (...) s
ON t.customer_id = s.customer_id AND t.is_current = TRUE
WHEN MATCHED AND (t.segment != s.segment) THEN
UPDATE SET is_current = FALSE, end_date = ...
WHEN NOT MATCHED THEN
INSERT ...;
-- Query: Historical accuracy
SELECT customer_id, segment, effective_date, end_date
FROM gold.dim_customer
WHERE customer_id = 1
ORDER BY effective_date;
-- Expected output:
-- | 1 | Standard | 2024-01-01 | 2024-07-01 |
-- | 1 | Premium | 2024-07-01 | NULL |
Evaluation
✅ MERGE is correctly structured
✅ Old record closed with end_date
✅ New record inserted with effective_date
✅ Query returns both historical versions
✅ Idempotent (running again produces no duplicates)
Lab 4: Query Optimization (Intermediate, 1 hour)
Scenario
This query is slow (scanners millions of rows):
SELECT
customer_id,
SUM(amount) as total
FROM unoptimized_fact_sales
WHERE sale_date BETWEEN '2024-06-01' AND '2024-06-30'
GROUP BY customer_id
LIMIT 10;
Deliverable
Apply optimization strategies:
-- Step 1: Create optimized table with partitioning & clustering
CREATE TABLE optimized_fact_sales
PARTITION BY DATE(sale_date)
CLUSTER BY customer_id
AS SELECT * FROM unoptimized_fact_sales;
-- Step 2: Add materialized view for common aggregation
CREATE MATERIALIZED VIEW mv_customer_revenue AS
SELECT
DATE_TRUNC(MONTH, sale_date) as month,
customer_id,
SUM(amount) as total,
COUNT(*) as num_transactions
FROM optimized_fact_sales
GROUP BY 1, 2;
-- Step 3: Rewrite query to use MV
SELECT customer_id, total
FROM mv_customer_revenue
WHERE month = '2024-06-01'
LIMIT 10;
-- Step 4: Measure improvement
-- Before: 50 seconds, 500 GB scanned
-- After: 0.2 seconds, 10 MB scanned
Evaluation
✅ Partition by DATE
✅ Cluster by frequently-filtered column
✅ Materialized view for aggregation
✅ Queries select only needed columns
✅ Performance improvement documented
Lab 5: Data Pipeline with dbt (Advanced, 1.5 hours)
Scenario
Build an ELT pipeline using dbt:
Raw data → Staging → Marts
Setup
Deliverable: dbt Project Structure
models/
├─ staging/
│ ├─ stg_customers.sql
│ ├─ stg_orders.sql
│ └─ _stg_sources.yml
│
├─ marts/
│ ├─ dim_customer.sql
│ ├─ fact_sales.sql
│ └─ _mart_tests.yml
│
└─ _sources.yml # Source definitions
tests/
└─ generic_tests.sql
dbt_project.yml
dbt Model Examples
-- models/staging/stg_customers.sql
WITH source_data AS (
SELECT * FROM {{ source('raw', 'customers') }}
),
transformed AS (
SELECT
customer_id,
LOWER(email) as email,
CASE
WHEN age < 18 THEN 'teen'
WHEN age < 65 THEN 'adult'
ELSE 'senior'
END as age_group,
created_date,
CURRENT_TIMESTAMP() as dbt_loaded_at
FROM source_data
WHERE customer_id IS NOT NULL
)
SELECT * FROM transformed
-- models/marts/dim_customer.sql
{{ config(
materialized='table',
indexes=[
{'columns': ['customer_id'], 'unique': true}
]
) }}
WITH customers AS (
SELECT * FROM {{ ref('stg_customers') }}
)
SELECT
ROW_NUMBER() OVER (ORDER BY customer_id) as customer_key,
customer_id,
email,
age_group,
created_date,
CURRENT_TIMESTAMP() as dbt_loaded_at
FROM customers
dbt Tests
# models/_mart_tests.yml
version: 2
models:
- name: dim_customer
tests:
- dbt_utils.recency:
datepart: day
field: dbt_loaded_at
interval: 1
columns:
- name: customer_key
tests:
- unique
- not_null
- name: email
tests:
- unique
- regex_match:
expression: '^.+@.+\..+$'
Evaluation
✅ DAG shows correct dependencies (staging → marts)
✅ Transformations use {{ ref() }} for lineage
✅ Tests run and pass
✅ dbt docs generate produces documentation
✅ dbt lineage shows upstream sources
Lab 6: Data Quality Framework (Advanced, 1.5 hours)
Scenario
Implement a data quality monitoring system for the warehouse.
Deliverable
# data_quality.py
import pandas as pd
from datetime import datetime
class DataQualityMonitor:
def __init__(self, warehouse_connection):
self.conn = warehouse_connection
def run_quality_checks(self, table_name):
metrics = {}
# Check 1: Row count
result = self.conn.execute(f"SELECT COUNT(*) as cnt FROM {table_name}")
metrics['row_count'] = result[0]['cnt']
# Check 2: Null check
result = self.conn.execute(
f"SELECT COUNT(*) as null_count FROM {table_name} WHERE key_col IS NULL"
)
metrics['null_percent'] = (result[0]['null_count'] / metrics['row_count']) * 100
# Check 3: Uniqueness
result = self.conn.execute(
f"SELECT COUNT(DISTINCT id) as unique_cnt FROM {table_name}"
)
metrics['uniqueness_percent'] = (result[0]['unique_cnt'] / metrics['row_count']) * 100
# Check 4: Freshness
result = self.conn.execute(
f"SELECT MAX(load_timestamp) as last_load FROM {table_name}"
)
hours_old = (datetime.now() - result[0]['last_load']).total_seconds() / 3600
metrics['freshness_hours'] = hours_old
return metrics
def generate_quality_report(self, table_name, thresholds):
metrics = self.run_quality_checks(table_name)
status = 'PASS'
issues = []
if metrics['null_percent'] > thresholds['null_percent']:
status = 'WARN'
issues.append(f"Nulls: {metrics['null_percent']:.1f}%")
if metrics['uniqueness_percent'] < thresholds['uniqueness_percent']:
status = 'FAIL'
issues.append(f"Duplicates detected")
if metrics['freshness_hours'] > thresholds['freshness_hours']:
status = 'WARN'
issues.append(f"Stale: {metrics['freshness_hours']:.1f} hours old")
return {
'table': table_name,
'timestamp': datetime.now(),
'status': status,
'metrics': metrics,
'issues': issues
}
# Usage
monitor = DataQualityMonitor(warehouse_conn)
report = monitor.generate_quality_report(
'gold.fact_sales',
thresholds={
'null_percent': 1.0,
'uniqueness_percent': 99.0,
'freshness_hours': 4.0
}
)
print(f"Status: {report['status']}")
print(f"Issues: {report['issues']}")
Evaluation
✅ Monitors row count
✅ Detects nulls and duplicates
✅ Tracks freshness (SLA violations)
✅ Generates alertable reports
✅ Extensible to new checks
Running the Labs
Lab Prerequisites
# Set up environments
pip install dbt-bigquery # OR dbt-snowflake, dbt-postgres
dbt --version
# Or use online playgrounds:
# - BigQuery: cloud.google.com/bigquery (free tier includes lab data)
# - Snowflake: www.snowflake.com/trial (free trial, 30 days)
Suggested Lab Order
- Start with Lab 1 (Dimensional Modeling) — foundational SQL
- Add Lab 2 (Medallion) — understand layering
- Lab 3 (SCD Type 2) — deepen modeling skills
- Lab 4 (Query Optimization) — pragmatic optimization
- Lab 5 (dbt Pipeline) — production-level tooling
- Lab 6 (Quality Framework) — enterprise concerns
Lab Submission & Evaluation
What to Submit
- SQL DDL (schema definitions)
- dbt project (if applicable)
- Performance metrics (before/after optimization)
- Data quality report
- Brief explanation of design decisions
Grading Rubric
| Criteria | Points |
|---|---|
| Functional correctness | 40% |
| Code quality (readable, documented) | 30% |
| Design choices (justified decisions) | 20% |
| Performance (optimizations applied) | 10% |
Key Takeaways from Labs
✅ Hands-on experience matters more than theory.
✅ Data quality isn't optional — build it from the start.
✅ SCD Type 2 is essential for historical accuracy.
✅ dbt is the modern standard for SQL transformations.
✅ Optimization is empirical — measure before and after.
✅ Monitoring catches problems before users notice.
Next Steps
After completing labs:
- Pick one tool (dbt, Airflow, Spark) and go deep
- Build a real project (even if hobby project)
- Review Section 11 (Interview Scenarios) for synthesis