03 · Medallion Architecture: Bronze, Silver, Gold Layers

Level: Intermediate Time to read: 20 min Pre-reading: 01 · Foundations · 02 · Warehouse Architecture After reading: You'll understand the modern layered approach to organizing data warehouses that balances flexibility, governance, and performance.


The Problem with Monolithic Data Warehouses

Traditional Single-Layer Approach

graph LR
    A["Operational<br/>Systems"] -->|ETL| B["Data Warehouse<br/>(All-Purpose)"]
    B --> C["BI & Analytics"]
    B -->|Bug found| D["Manual Fixes & Reruns"]

Problems:

  1. No clear separation between raw, cleaned, and business-ready data
  2. Debugging is hard — if analytics break, hard to trace root cause
  3. Data quality issues propagate to all downstream consumers
  4. Difficult to define ownership (raw data owner vs. business logic owner)
  5. Governance is all-or-nothing (no intermediate validation layer)

The Medallion Solution

graph LR
    A["Operational<br/>Systems"]
    B["🥉 Bronze<br/>(Raw)"]
    C["🥈 Silver<br/>(Cleaned)"]
    D["🥇 Gold<br/>(Business)"]
    E["BI & Analytics"]

    A -->|"ELT: Extract & Load<br/>(minimal transformation)"| B
    B -->|"Data quality,<br/>deduplication"| C
    C -->|"Business logic,<br/>conformed dims"| D
    D -->|"Validated, reliable"| E

Layer 1: Bronze Layer (Raw Data)

Purpose: Ingest data from operational systems in its rawest form — minimal transformation.

Characteristics

Aspect Detail
Data Quality As-is from source (quality is not guaranteed)
Schema Source schema + technical metadata (extraction date, source ID)
Duplication Duplicate rows possible (not deduplicated yet)
Update Strategy Append-only (SCD Type 6: keep all versions)
Ownership Platform/DataOps team
Consumer Grade Not for end-users; for technical teams debugging

Example: Customer Bronze Table

-- BRONZE: Raw dump from source system
CREATE TABLE IF NOT EXISTS bronze.customers (
  _elt_timestamp TIMESTAMP,         -- When loaded
  _elt_source_id STRING,            -- Which source system

  -- Exact columns from source
  customer_id INT,
  name STRING,
  email STRING,
  phone STRING,
  address STRING,
  age INT,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);

-- Append-only: preserve all versions for audit
INSERT INTO bronze.customers
SELECT 
  current_timestamp() as _elt_timestamp,
  'source_system_1' as _elt_source_id,
  t.* FROM raw_source.customers t;

Design Principles

Minimize transformation

  • Extract exactly as-is from source
  • Add only technical metadata (timestamp, source ID)

Append-only pattern

  • Never UPDATE/DELETE raw data
  • Preserves history for debugging and compliance

Fault tolerance

  • If downstream fails, re-run from bronze without re-extracting source
  • Bronze is replayed version of source

Schema flexibility

  • Schema-on-read acceptable (messy schemas okay)
  • Technical team manually documents deviations

What Bronze Saves

If analytical logic breaks:

1. Check Silver layer (was data quality the issue?)
2. Check Bronze layer (was source data wrong?)
→ No need to re-extract from production system
→ Can trace root cause to source or transformation


Layer 2: Silver Layer (Cleansed & Integrated)

Purpose: Standardize, deduplicate, and validate data for business use.

Characteristics

Aspect Detail
Data Quality Validated (nulls handled, duplicates removed, constraints checked)
Schema Business-standard schema (consistent naming, types)
Duplication Deduplicated (business keys unique)
Update Strategy Primarily SCD Type 1 (update in-place) or Type 2 (track history)
Ownership Data quality team + analytics team
Consumer Grade Yes; internal tools and data scientists can use directly

Example: Customer Silver Table

-- SILVER: Clean, deduplicated, business-ready
CREATE TABLE IF NOT EXISTS silver.customers (
  customer_id INT,
  customer_key STRING UNIQUE,      -- Deduplication key

  -- Clean, standardized columns
  first_name STRING,
  last_name STRING,
  email STRING,
  country_code STRING,             -- Validated against domain list
  age_group STRING,                -- Binned from age with null handling
  created_date DATE,               -- Parse to DATE, handle nulls

  -- Data quality flags
  is_valid BOOLEAN,                -- Passed all validations
  data_quality_issues STRING,      -- CSV of problems (if any)

  -- SCD Type 2: Track changes
  effective_date DATE,
  end_date DATE,
  is_current BOOLEAN,

  _elt_timestamp TIMESTAMP
);

-- Transformation: Bronze → Silver
MERGE INTO silver.customers t
USING (
  SELECT 
    customer_id,
    CONCAT(name, '_', customer_id) as customer_key,
    SPLIT_PART(name, ' ', 1) as first_name,
    SPLIT_PART(name, ' ', 2) as last_name,
    LOWER(email) as email,
    COALESCE(UPPER(SUBSTRING(address, -2)), 'XX') as country_code,

    CASE 
      WHEN age IS NULL THEN 'Unknown'
      WHEN age < 18 THEN 'Teen'
      WHEN age < 65 THEN 'Adult'
      ELSE 'Senior'
    END as age_group,

    CAST(created_at AS DATE) as created_date,

    -- Data quality check
    CASE 
      WHEN email IS NULL OR email NOT LIKE '%@%' THEN FALSE 
      WHEN customer_id IS NULL THEN FALSE 
      ELSE TRUE 
    END as is_valid,

    CASE 
      WHEN email IS NULL THEN 'missing_email'
      WHEN email NOT LIKE '%@%' THEN 'invalid_email'
      ELSE NULL 
    END as data_quality_issues,

    current_date() as effective_date,
    null as end_date,
    true as is_current,
    current_timestamp() as _elt_timestamp
  FROM bronze.customers b
) s
ON t.customer_id = s.customer_id AND t.is_current = TRUE
WHEN MATCHED AND (
  t.email != s.email OR 
  t.age_group != s.age_group 
) THEN UPDATE SET 
  is_current = FALSE,
  end_date = current_date()
WHEN NOT MATCHED THEN INSERT *;

Key Responsibilities

Task Owner Frequency
Deduplication Quality team Daily (detect dups from source)
Type validation Quality team Daily (ensure DATE is DATE, not string)
Domain validation Quality team Daily (country codes valid, emails formatted)
Null handling Quality team Daily (replace with defaults or flags)
SCD Type 2 Analytics team Daily (track changes for historical analysis)
Integration Analytics team Weekly (join multiple source dimensions)

Layer 3: Gold Layer (Business-Ready)

Purpose: Model data for specific business domains using dimensional modeling (star schema).

Characteristics

Aspect Detail
Data Quality Guaranteed (multiple validation layers below)
Schema Dimensional (facts + conformed dimensions, star schema)
Organized By Business use case (sales, customer, marketing)
Ownership Business analytics team / Domain experts
Consumer Grade ✅ Yes; business intelligence, dashboards, data science

Example: Sales Gold Models

-- GOLD: Fact table from clean Silver data
CREATE TABLE IF NOT EXISTS gold.fact_sales (
  sale_key STRING PRIMARY KEY,

  -- Dimensional references
  customer_key STRING,
  product_key STRING,
  date_key INT,  -- References dim_date

  -- Additive measures
  quantity INT,
  amount DECIMAL(18, 2),
  cost DECIMAL(18, 2),
  discount_amount DECIMAL(18, 2),

  -- Non-additive (for context)
  unit_price DECIMAL(18, 2),

  created_timestamp TIMESTAMP,
  updated_timestamp TIMESTAMP
);

-- GOLD: Conformed dimension (integrated from multiple sources)
CREATE TABLE IF NOT EXISTS gold.dim_customer (
  customer_key STRING PRIMARY KEY,

  -- Degenerate & descriptive attributes
  customer_id INT,
  name STRING,
  email STRING,
  country_code STRING,
  segment STRING,  -- Computed from Silver data
  lifetime_value DECIMAL,  -- Computed metric

  -- SCD Type 2: Track changes
  effective_date DATE,
  end_date DATE,
  is_current BOOLEAN
);

-- Build Gold from Silver
INSERT INTO gold.fact_sales
SELECT 
  CONCAT(s.sale_id, '_', s.customer_id, '_', s.product_id) as sale_key,
  c.customer_key,
  p.product_key,
  d.date_key,
  s.quantity,
  s.amount,
  s.cost,
  s.discount_amount,
  s.unit_price,
  current_timestamp(),
  current_timestamp()
FROM silver.sales s
JOIN silver.customers c ON s.customer_id = c.customer_id
JOIN silver.products p ON s.product_id = p.product_id
JOIN silver.dates d ON DATE(s.sale_date) = d.date_value;

Conformed Dimensions

Gold layer uses conformed dimensions — shared, integrated dimensions across all fact tables.

gold.dim_date
├── Shared by: fact_sales, fact_inventory, fact_website_visits
├── Contains: day_key, date, month, quarter, year, fiscal_period
└── Ensures: All fact tables use the same time logic

gold.dim_customer
├── Shared by: fact_sales, fact_support_tickets, fact_marketing_campaign
├── Contains: customer_key, name, segment, lifetime_value
└── Ensures: Analytics across domains uses consistent customer definition

Data Flow Visualization

graph TB
    A["Operational<br/>System"]

    B["🥉 Bronze Layer<br/>(Raw Data Store)"]
    B1["bronze.customers<br/>bronze.orders<br/>bronze.products"]

    C["Data Quality<br/>Checks"]

    D["🥈 Silver Layer<br/>(Clean & Standardized)"]
    D1["silver.customers<br/>silver.orders<br/>silver.products"]

    E["Business<br/>Logic"]

    F["🥇 Gold Layer<br/>(Dimensional Models)"]
    F1["fact_sales<br/>dim_customer<br/>dim_product"]
    F2["fact_inventory<br/>dim_store<br/>dim_date"]

    G["Analytics & BI"]

    A -->|"Dump<br/>(minimal transform)"| B
    B --> B1
    B1 -->|"Dedup,<br/>validate"| C
    C --> D
    D --> D1
    D1 -->|"Dimensional<br/>models"| E
    E --> F
    F --> F1
    F --> F2
    F1 --> G
    F2 --> G

Governance Model: Who Owns What?

Layer Owned By Responsibility SLA
Bronze Data/Platform team Raw ingest, schema preservation 99.9% uptime
Silver Data quality team Validation, deduplication, standardization 4-hour SLA errors
Gold Analytics/Domain team Dimensional models, business logic, documentation 2-hour SLA errors

Common Patterns

Pattern 1: Medallion + Lakehouse

Medallion layers can live in:

  • Data Lake (Bronze): Cheap object storage (S3, GCS, ADLS)
  • Data Warehouse (Silver/Gold): Managed SQL (BigQuery, Snowflake)
Bronze: ADLS (unstructured, Parquet files)
Silver: BigQuery (semi-structured, validation rules)
Gold: BigQuery (highly structured, star schema)

Pattern 2: Medallion + Data Vault

Some organizations combine medallion with data vault concepts:

  • Bronze: Raw tables (1:1 with source)
  • Silver: Hub/Link/Satellite tables (data vault structure)
  • Gold: Business-ready flattened dimensions & facts

Key Decision: SCD Strategy per Layer

Layer SCD Type Rationale
Bronze None (append-only) Preserve all historical data for debugging
Silver Type 1 or 2 Type 2 if historical changes matter for analytics
Gold Type 2 Track dimension changes for time-series accuracy

Deep-Dives

Deep Dive: Bronze Layer — Schema-on-read, schema inference, dealing with format changes.

Deep Dive: Silver Layer — Data quality frameworks, deduplication strategies, SCD Type 2 implementation.

Deep Dive: Gold Layer — Dimensional modeling for Gold, conformed dimensions, optimization.


Key Takeaways

Medallion architecture separates concerns: raw ingestion (Bronze) → validation (Silver) → business models (Gold).

Bronze layer is append-only, preserving all data for debugging without re-extraction.

Silver layer enforces data quality and standardization before business consumption.

Gold layer uses dimensional modeling (star schema) for business-ready analytics.

Clear ownership at each layer ensures accountability and scalability.

Governance by layer allows different SLAs and update strategies (append vs. merge).


Practice Questions

  1. Why is Bronze layer append-only?
  2. What's the difference between Silver and Gold layer data quality guarantees?
  3. What's a "conformed dimension" and why does it matter for Gold layer?
  4. How does medallion architecture help with debugging data quality issues?
  5. When would you choose SCD Type 2 in Silver vs. Gold layer?