03 · Medallion Architecture: Bronze, Silver, Gold Layers

Level: Intermediate Time to read: 20 min Pre-reading: 01 · Foundations · 02 · Warehouse Architecture After reading: You'll understand the modern layered approach to organizing data warehouses that balances flexibility, governance, and performance.

The Problem with Monolithic Data Warehouses

Traditional Single-Layer Approach

graph LR
    A["Operational<br/>Systems"] -->|ETL| B["Data Warehouse<br/>(All-Purpose)"]
    B --> C["BI & Analytics"]
    B -->|Bug found| D["Manual Fixes & Reruns"]

Problems:

No clear separation between raw, cleaned, and business-ready data
Debugging is hard — if analytics break, hard to trace root cause
Data quality issues propagate to all downstream consumers
Difficult to define ownership (raw data owner vs. business logic owner)
Governance is all-or-nothing (no intermediate validation layer)

The Medallion Solution

graph LR
    A["Operational<br/>Systems"]
    B["🥉 Bronze<br/>(Raw)"]
    C["🥈 Silver<br/>(Cleaned)"]
    D["🥇 Gold<br/>(Business)"]
    E["BI & Analytics"]

    A -->|"ELT: Extract & Load<br/>(minimal transformation)"| B
    B -->|"Data quality,<br/>deduplication"| C
    C -->|"Business logic,<br/>conformed dims"| D
    D -->|"Validated, reliable"| E

Layer 1: Bronze Layer (Raw Data)

Purpose: Ingest data from operational systems in its rawest form — minimal transformation.

Characteristics

Aspect	Detail
Data Quality	As-is from source (quality is not guaranteed)
Schema	Source schema + technical metadata (extraction date, source ID)
Duplication	Duplicate rows possible (not deduplicated yet)
Update Strategy	Append-only (SCD Type 6: keep all versions)
Ownership	Platform/DataOps team
Consumer Grade	Not for end-users; for technical teams debugging

Example: Customer Bronze Table

-- BRONZE: Raw dump from source system
CREATE TABLE IF NOT EXISTS bronze.customers (
  _elt_timestamp TIMESTAMP,         -- When loaded
  _elt_source_id STRING,            -- Which source system

  -- Exact columns from source
  customer_id INT,
  name STRING,
  email STRING,
  phone STRING,
  address STRING,
  age INT,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);

-- Append-only: preserve all versions for audit
INSERT INTO bronze.customers
SELECT 
  current_timestamp() as _elt_timestamp,
  'source_system_1' as _elt_source_id,
  t.* FROM raw_source.customers t;

Design Principles

✅ Minimize transformation

Extract exactly as-is from source
Add only technical metadata (timestamp, source ID)

✅ Append-only pattern

Never UPDATE/DELETE raw data
Preserves history for debugging and compliance

✅ Fault tolerance

If downstream fails, re-run from bronze without re-extracting source
Bronze is replayed version of source

✅ Schema flexibility

Schema-on-read acceptable (messy schemas okay)
Technical team manually documents deviations

What Bronze Saves

If analytical logic breaks:

1. Check Silver layer (was data quality the issue?)
2. Check Bronze layer (was source data wrong?)
→ No need to re-extract from production system
→ Can trace root cause to source or transformation

Layer 2: Silver Layer (Cleansed & Integrated)

Purpose: Standardize, deduplicate, and validate data for business use.

Characteristics

Aspect	Detail
Data Quality	Validated (nulls handled, duplicates removed, constraints checked)
Schema	Business-standard schema (consistent naming, types)
Duplication	Deduplicated (business keys unique)
Update Strategy	Primarily SCD Type 1 (update in-place) or Type 2 (track history)
Ownership	Data quality team + analytics team
Consumer Grade	Yes; internal tools and data scientists can use directly

Example: Customer Silver Table

-- SILVER: Clean, deduplicated, business-ready
CREATE TABLE IF NOT EXISTS silver.customers (
  customer_id INT,
  customer_key STRING UNIQUE,      -- Deduplication key

  -- Clean, standardized columns
  first_name STRING,
  last_name STRING,
  email STRING,
  country_code STRING,             -- Validated against domain list
  age_group STRING,                -- Binned from age with null handling
  created_date DATE,               -- Parse to DATE, handle nulls

  -- Data quality flags
  is_valid BOOLEAN,                -- Passed all validations
  data_quality_issues STRING,      -- CSV of problems (if any)

  -- SCD Type 2: Track changes
  effective_date DATE,
  end_date DATE,
  is_current BOOLEAN,

  _elt_timestamp TIMESTAMP
);

-- Transformation: Bronze → Silver
MERGE INTO silver.customers t
USING (
  SELECT 
    customer_id,
    CONCAT(name, '_', customer_id) as customer_key,
    SPLIT_PART(name, ' ', 1) as first_name,
    SPLIT_PART(name, ' ', 2) as last_name,
    LOWER(email) as email,
    COALESCE(UPPER(SUBSTRING(address, -2)), 'XX') as country_code,

    CASE 
      WHEN age IS NULL THEN 'Unknown'
      WHEN age < 18 THEN 'Teen'
      WHEN age < 65 THEN 'Adult'
      ELSE 'Senior'
    END as age_group,

    CAST(created_at AS DATE) as created_date,

    -- Data quality check
    CASE 
      WHEN email IS NULL OR email NOT LIKE '%@%' THEN FALSE 
      WHEN customer_id IS NULL THEN FALSE 
      ELSE TRUE 
    END as is_valid,

    CASE 
      WHEN email IS NULL THEN 'missing_email'
      WHEN email NOT LIKE '%@%' THEN 'invalid_email'
      ELSE NULL 
    END as data_quality_issues,

    current_date() as effective_date,
    null as end_date,
    true as is_current,
    current_timestamp() as _elt_timestamp
  FROM bronze.customers b
) s
ON t.customer_id = s.customer_id AND t.is_current = TRUE
WHEN MATCHED AND (
  t.email != s.email OR 
  t.age_group != s.age_group 
) THEN UPDATE SET 
  is_current = FALSE,
  end_date = current_date()
WHEN NOT MATCHED THEN INSERT *;

Key Responsibilities

Task	Owner	Frequency
Deduplication	Quality team	Daily (detect dups from source)
Type validation	Quality team	Daily (ensure DATE is DATE, not string)
Domain validation	Quality team	Daily (country codes valid, emails formatted)
Null handling	Quality team	Daily (replace with defaults or flags)
SCD Type 2	Analytics team	Daily (track changes for historical analysis)
Integration	Analytics team	Weekly (join multiple source dimensions)

Layer 3: Gold Layer (Business-Ready)

Purpose: Model data for specific business domains using dimensional modeling (star schema).

Characteristics

Aspect	Detail
Data Quality	Guaranteed (multiple validation layers below)
Schema	Dimensional (facts + conformed dimensions, star schema)
Organized By	Business use case (sales, customer, marketing)
Ownership	Business analytics team / Domain experts
Consumer Grade	✅ Yes; business intelligence, dashboards, data science

Example: Sales Gold Models

-- GOLD: Fact table from clean Silver data
CREATE TABLE IF NOT EXISTS gold.fact_sales (
  sale_key STRING PRIMARY KEY,

  -- Dimensional references
  customer_key STRING,
  product_key STRING,
  date_key INT,  -- References dim_date

  -- Additive measures
  quantity INT,
  amount DECIMAL(18, 2),
  cost DECIMAL(18, 2),
  discount_amount DECIMAL(18, 2),

  -- Non-additive (for context)
  unit_price DECIMAL(18, 2),

  created_timestamp TIMESTAMP,
  updated_timestamp TIMESTAMP
);

-- GOLD: Conformed dimension (integrated from multiple sources)
CREATE TABLE IF NOT EXISTS gold.dim_customer (
  customer_key STRING PRIMARY KEY,

  -- Degenerate & descriptive attributes
  customer_id INT,
  name STRING,
  email STRING,
  country_code STRING,
  segment STRING,  -- Computed from Silver data
  lifetime_value DECIMAL,  -- Computed metric

  -- SCD Type 2: Track changes
  effective_date DATE,
  end_date DATE,
  is_current BOOLEAN
);

-- Build Gold from Silver
INSERT INTO gold.fact_sales
SELECT 
  CONCAT(s.sale_id, '_', s.customer_id, '_', s.product_id) as sale_key,
  c.customer_key,
  p.product_key,
  d.date_key,
  s.quantity,
  s.amount,
  s.cost,
  s.discount_amount,
  s.unit_price,
  current_timestamp(),
  current_timestamp()
FROM silver.sales s
JOIN silver.customers c ON s.customer_id = c.customer_id
JOIN silver.products p ON s.product_id = p.product_id
JOIN silver.dates d ON DATE(s.sale_date) = d.date_value;

Conformed Dimensions

Gold layer uses conformed dimensions — shared, integrated dimensions across all fact tables.

gold.dim_date
├── Shared by: fact_sales, fact_inventory, fact_website_visits
├── Contains: day_key, date, month, quarter, year, fiscal_period
└── Ensures: All fact tables use the same time logic

gold.dim_customer
├── Shared by: fact_sales, fact_support_tickets, fact_marketing_campaign
├── Contains: customer_key, name, segment, lifetime_value
└── Ensures: Analytics across domains uses consistent customer definition

Data Flow Visualization

graph TB
    A["Operational<br/>System"]

    B["🥉 Bronze Layer<br/>(Raw Data Store)"]
    B1["bronze.customers<br/>bronze.orders<br/>bronze.products"]

    C["Data Quality<br/>Checks"]

    D["🥈 Silver Layer<br/>(Clean & Standardized)"]
    D1["silver.customers<br/>silver.orders<br/>silver.products"]

    E["Business<br/>Logic"]

    F["🥇 Gold Layer<br/>(Dimensional Models)"]
    F1["fact_sales<br/>dim_customer<br/>dim_product"]
    F2["fact_inventory<br/>dim_store<br/>dim_date"]

    G["Analytics & BI"]

    A -->|"Dump<br/>(minimal transform)"| B
    B --> B1
    B1 -->|"Dedup,<br/>validate"| C
    C --> D
    D --> D1
    D1 -->|"Dimensional<br/>models"| E
    E --> F
    F --> F1
    F --> F2
    F1 --> G
    F2 --> G

Governance Model: Who Owns What?

Layer	Owned By	Responsibility	SLA
Bronze	Data/Platform team	Raw ingest, schema preservation	99.9% uptime
Silver	Data quality team	Validation, deduplication, standardization	4-hour SLA errors
Gold	Analytics/Domain team	Dimensional models, business logic, documentation	2-hour SLA errors

Common Patterns

Pattern 1: Medallion + Lakehouse

Medallion layers can live in:

Data Lake (Bronze): Cheap object storage (S3, GCS, ADLS)
Data Warehouse (Silver/Gold): Managed SQL (BigQuery, Snowflake)

Bronze: ADLS (unstructured, Parquet files)
  ↓
Silver: BigQuery (semi-structured, validation rules)
  ↓
Gold: BigQuery (highly structured, star schema)

Pattern 2: Medallion + Data Vault

Some organizations combine medallion with data vault concepts:

Bronze: Raw tables (1:1 with source)
Silver: Hub/Link/Satellite tables (data vault structure)
Gold: Business-ready flattened dimensions & facts

Key Decision: SCD Strategy per Layer

Layer	SCD Type	Rationale
Bronze	None (append-only)	Preserve all historical data for debugging
Silver	Type 1 or 2	Type 2 if historical changes matter for analytics
Gold	Type 2	Track dimension changes for time-series accuracy

Deep-Dives

→ Deep Dive: Bronze Layer — Schema-on-read, schema inference, dealing with format changes.

→ Deep Dive: Silver Layer — Data quality frameworks, deduplication strategies, SCD Type 2 implementation.

→ Deep Dive: Gold Layer — Dimensional modeling for Gold, conformed dimensions, optimization.

Key Takeaways

✅ Medallion architecture separates concerns: raw ingestion (Bronze) → validation (Silver) → business models (Gold).

✅ Bronze layer is append-only, preserving all data for debugging without re-extraction.

✅ Silver layer enforces data quality and standardization before business consumption.

✅ Gold layer uses dimensional modeling (star schema) for business-ready analytics.

✅ Clear ownership at each layer ensures accountability and scalability.

✅ Governance by layer allows different SLAs and update strategies (append vs. merge).

Practice Questions

Why is Bronze layer append-only?
What's the difference between Silver and Gold layer data quality guarantees?
What's a "conformed dimension" and why does it matter for Gold layer?
How does medallion architecture help with debugging data quality issues?
When would you choose SCD Type 2 in Silver vs. Gold layer?

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search