10 · Hands-On Labs: Building Data Warehouse Components

Level: Practical / All Levels Time to read + do: 2-4 hours depending on labs chosen Expected outcome: Hands-on experience with medallion architecture, dimensional modeling, ETL patterns, and optimization.

Lab Flow Chart

graph TD
    A["Lab 1: Dimensional Modeling<br/>Design a star schema"]
    B["Lab 2: Medallion Architecture<br/>Bronze → Silver → Gold"]
    C["Lab 3: SCD Type 2 Implementation<br/>Track customer changes"]
    D["Lab 4: Query Optimization<br/>Index & partition strategies"]
    E["Lab 5: ETL/ELT Pipeline<br/>dbt or Airflow"]
    F["Lab 6: Data Quality<br/>Testing framework"]

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F

Lab 1: Dimensional Modeling (Beginner, 30 min)

Scenario

You work for an e-commerce company. Design a star schema for:

Sales transactions
Customer demographics
Product catalog
Store locations
Date/time information

Deliverable

-- Submit SQL DDL creating:
-- 1. fact_sales (with surrogate key)
-- 2. dim_customer (with surrogate key, SCD Type 2)
-- 3. dim_product
-- 4. dim_store
-- 5. dim_date

-- Requirements:
-- - Use surrogate keys (INT auto-increment)
-- - Mark additive measures
-- - Include date/effective/end tracking columns for SCD
-- - Foreign key constraints fact → dimensions

Evaluation

✅ Dimensions are denormalized (not snowflaked)
✅ Fact table uses surrogate keys
✅ Additive measures identified correctly
✅ SCD Type 2 columns present (effective_date, end_date, is_current)

Lab 2: Medallion Architecture (Intermediate, 1 hour)

Scenario

Raw customer data arrives daily in CSV. Build the medallion pipeline:

Bronze: Load raw CSV as-is
Silver: Clean, deduplicate, validate
Gold: Dimensional customer dimension (SCD Type 2)

Data

customer_id,name,email,age,country
1,Alice Smith,alice@example.com,35,US
2,Bob Jones,bob@example.com,,UK
1,Alice Smith,alice@example.com,35,US  # Duplicate!
3,Charlie Brown,charlie,45,US  # Invalid email

Deliverable

-- Bronze Layer: Load as-is
CREATE TABLE bronze.customers (...);
INSERT INTO bronze.customers 
  SELECT * FROM your_csv;

-- Silver Layer: Clean & validate
CREATE TABLE silver.customers (...);
MERGE INTO silver.customers USING bronze.customers;
  -- Deduplicate
  -- Validate email format
  -- Handle nulls

-- Gold Layer: Dimensional model (SCD Type 2)
CREATE TABLE gold.dim_customer (...);
MERGE INTO gold.dim_customer USING silver.customers;
  -- Add surrogate key
  -- Track history with effective_date

Evaluation

✅ Bronze is append-only copy
✅ Silver deduplicates and validates
✅ Gold uses SCD Type 2 with date tracking
✅ Proper data lineage (Bronze → Silver → Gold)

Lab 3: SCD Type 2 Implementation (Intermediate, 45 min)

Scenario

Customer "Alice" (ID=1) changes her segment from "Standard" to "Premium".
Implement SCD Type 2 to track this change.

Initial Data

INSERT INTO gold.dim_customer VALUES 
  (1, 1, 'Alice', 'Standard', '2024-01-01', NULL, TRUE);

Change Arrives

Field: segment changes from 'Standard' to 'Premium'
Timestamp: 2024-07-01

Deliverable

-- Implement MERGE statement that:
-- 1. Closes old record (is_current = FALSE, end_date = TODAY)
-- 2. Inserts new record (is_current = TRUE, end_date = NULL)
-- 3. Is idempotent (running twice = running once)

MERGE INTO gold.dim_customer t
USING (...) s
ON t.customer_id = s.customer_id AND t.is_current = TRUE
WHEN MATCHED AND (t.segment != s.segment) THEN
  UPDATE SET is_current = FALSE, end_date = ...
WHEN NOT MATCHED THEN
  INSERT ...;

-- Query: Historical accuracy
SELECT customer_id, segment, effective_date, end_date
FROM gold.dim_customer
WHERE customer_id = 1
ORDER BY effective_date;

-- Expected output:
-- | 1 | Standard | 2024-01-01 | 2024-07-01 |
-- | 1 | Premium  | 2024-07-01 | NULL       |

Evaluation

✅ MERGE is correctly structured
✅ Old record closed with end_date
✅ New record inserted with effective_date
✅ Query returns both historical versions
✅ Idempotent (running again produces no duplicates)

Lab 4: Query Optimization (Intermediate, 1 hour)

Scenario

This query is slow (scanners millions of rows):

SELECT 
  customer_id, 
  SUM(amount) as total
FROM unoptimized_fact_sales
WHERE sale_date BETWEEN '2024-06-01' AND '2024-06-30'
GROUP BY customer_id
LIMIT 10;

Deliverable

Apply optimization strategies:

-- Step 1: Create optimized table with partitioning & clustering
CREATE TABLE optimized_fact_sales
PARTITION BY DATE(sale_date)
CLUSTER BY customer_id
AS SELECT * FROM unoptimized_fact_sales;

-- Step 2: Add materialized view for common aggregation
CREATE MATERIALIZED VIEW mv_customer_revenue AS
SELECT 
  DATE_TRUNC(MONTH, sale_date) as month,
  customer_id,
  SUM(amount) as total,
  COUNT(*) as num_transactions
FROM optimized_fact_sales
GROUP BY 1, 2;

-- Step 3: Rewrite query to use MV
SELECT customer_id, total
FROM mv_customer_revenue
WHERE month = '2024-06-01'
LIMIT 10;

-- Step 4: Measure improvement
-- Before: 50 seconds, 500 GB scanned
-- After: 0.2 seconds, 10 MB scanned

Evaluation

✅ Partition by DATE
✅ Cluster by frequently-filtered column
✅ Materialized view for aggregation
✅ Queries select only needed columns
✅ Performance improvement documented

Lab 5: Data Pipeline with dbt (Advanced, 1.5 hours)

Scenario

Build an ELT pipeline using dbt:
Raw data → Staging → Marts

Setup

# Create dbt project
dbt init warehouse_project
cd warehouse_project

Deliverable: dbt Project Structure

models/
├─ staging/
│  ├─ stg_customers.sql
│  ├─ stg_orders.sql
│  └─ _stg_sources.yml
│
├─ marts/
│  ├─ dim_customer.sql
│  ├─ fact_sales.sql
│  └─ _mart_tests.yml
│
└─ _sources.yml  # Source definitions

tests/
└─ generic_tests.sql

dbt_project.yml

dbt Model Examples

-- models/staging/stg_customers.sql
WITH source_data AS (
    SELECT * FROM {{ source('raw', 'customers') }}
),

transformed AS (
    SELECT 
        customer_id,
        LOWER(email) as email,
        CASE 
            WHEN age < 18 THEN 'teen'
            WHEN age < 65 THEN 'adult'
            ELSE 'senior'
        END as age_group,
        created_date,
        CURRENT_TIMESTAMP() as dbt_loaded_at
    FROM source_data
    WHERE customer_id IS NOT NULL
)

SELECT * FROM transformed

-- models/marts/dim_customer.sql
{{ config(
    materialized='table',
    indexes=[
        {'columns': ['customer_id'], 'unique': true}
    ]
) }}

WITH customers AS (
    SELECT * FROM {{ ref('stg_customers') }}
)

SELECT 
    ROW_NUMBER() OVER (ORDER BY customer_id) as customer_key,
    customer_id,
    email,
    age_group,
    created_date,
    CURRENT_TIMESTAMP() as dbt_loaded_at
FROM customers

dbt Tests

# models/_mart_tests.yml
version: 2

models:
  - name: dim_customer
    tests:
      - dbt_utils.recency:
          datepart: day
          field: dbt_loaded_at
          interval: 1

    columns:
      - name: customer_key
        tests:
          - unique
          - not_null

      - name: email
        tests:
          - unique
          - regex_match:
              expression: '^.+@.+\..+$'

Evaluation

✅ DAG shows correct dependencies (staging → marts)
✅ Transformations use {{ ref() }} for lineage
✅ Tests run and pass
✅ dbt docs generate produces documentation
✅ dbt lineage shows upstream sources

Lab 6: Data Quality Framework (Advanced, 1.5 hours)

Scenario

Implement a data quality monitoring system for the warehouse.

Deliverable

# data_quality.py
import pandas as pd
from datetime import datetime

class DataQualityMonitor:
    def __init__(self, warehouse_connection):
        self.conn = warehouse_connection

    def run_quality_checks(self, table_name):
        metrics = {}

        # Check 1: Row count
        result = self.conn.execute(f"SELECT COUNT(*) as cnt FROM {table_name}")
        metrics['row_count'] = result[0]['cnt']

        # Check 2: Null check
        result = self.conn.execute(
            f"SELECT COUNT(*) as null_count FROM {table_name} WHERE key_col IS NULL"
        )
        metrics['null_percent'] = (result[0]['null_count'] / metrics['row_count']) * 100

        # Check 3: Uniqueness
        result = self.conn.execute(
            f"SELECT COUNT(DISTINCT id) as unique_cnt FROM {table_name}"
        )
        metrics['uniqueness_percent'] = (result[0]['unique_cnt'] / metrics['row_count']) * 100

        # Check 4: Freshness
        result = self.conn.execute(
            f"SELECT MAX(load_timestamp) as last_load FROM {table_name}"
        )
        hours_old = (datetime.now() - result[0]['last_load']).total_seconds() / 3600
        metrics['freshness_hours'] = hours_old

        return metrics

    def generate_quality_report(self, table_name, thresholds):
        metrics = self.run_quality_checks(table_name)

        status = 'PASS'
        issues = []

        if metrics['null_percent'] > thresholds['null_percent']:
            status = 'WARN'
            issues.append(f"Nulls: {metrics['null_percent']:.1f}%")

        if metrics['uniqueness_percent'] < thresholds['uniqueness_percent']:
            status = 'FAIL'
            issues.append(f"Duplicates detected")

        if metrics['freshness_hours'] > thresholds['freshness_hours']:
            status = 'WARN'
            issues.append(f"Stale: {metrics['freshness_hours']:.1f} hours old")

        return {
            'table': table_name,
            'timestamp': datetime.now(),
            'status': status,
            'metrics': metrics,
            'issues': issues
        }

# Usage
monitor = DataQualityMonitor(warehouse_conn)
report = monitor.generate_quality_report(
    'gold.fact_sales',
    thresholds={
        'null_percent': 1.0,
        'uniqueness_percent': 99.0,
        'freshness_hours': 4.0
    }
)

print(f"Status: {report['status']}")
print(f"Issues: {report['issues']}")

Evaluation

✅ Monitors row count
✅ Detects nulls and duplicates
✅ Tracks freshness (SLA violations)
✅ Generates alertable reports
✅ Extensible to new checks

Running the Labs

Lab Prerequisites

# Set up environments
pip install dbt-bigquery  # OR dbt-snowflake, dbt-postgres
dbt --version

# Or use online playgrounds:
# - BigQuery: cloud.google.com/bigquery (free tier includes lab data)
# - Snowflake: www.snowflake.com/trial (free trial, 30 days)

Suggested Lab Order

Start with Lab 1 (Dimensional Modeling) — foundational SQL
Add Lab 2 (Medallion) — understand layering
Lab 3 (SCD Type 2) — deepen modeling skills
Lab 4 (Query Optimization) — pragmatic optimization
Lab 5 (dbt Pipeline) — production-level tooling
Lab 6 (Quality Framework) — enterprise concerns

Lab Submission & Evaluation

What to Submit

SQL DDL (schema definitions)
dbt project (if applicable)
Performance metrics (before/after optimization)
Data quality report
Brief explanation of design decisions

Grading Rubric

Criteria	Points
Functional correctness	40%
Code quality (readable, documented)	30%
Design choices (justified decisions)	20%
Performance (optimizations applied)	10%

Key Takeaways from Labs

✅ Hands-on experience matters more than theory.

✅ Data quality isn't optional — build it from the start.

✅ SCD Type 2 is essential for historical accuracy.

✅ dbt is the modern standard for SQL transformations.

✅ Optimization is empirical — measure before and after.

✅ Monitoring catches problems before users notice.

Next Steps

After completing labs:

Pick one tool (dbt, Airflow, Spark) and go deep
Build a real project (even if hobby project)
Review Section 11 (Interview Scenarios) for synthesis

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search