03 · Medallion Architecture: Bronze, Silver, Gold Layers
Level: Intermediate Time to read: 20 min Pre-reading: 01 · Foundations · 02 · Warehouse Architecture After reading: You'll understand the modern layered approach to organizing data warehouses that balances flexibility, governance, and performance.
The Problem with Monolithic Data Warehouses
Traditional Single-Layer Approach
graph LR
A["Operational<br/>Systems"] -->|ETL| B["Data Warehouse<br/>(All-Purpose)"]
B --> C["BI & Analytics"]
B -->|Bug found| D["Manual Fixes & Reruns"]
Problems:
- No clear separation between raw, cleaned, and business-ready data
- Debugging is hard — if analytics break, hard to trace root cause
- Data quality issues propagate to all downstream consumers
- Difficult to define ownership (raw data owner vs. business logic owner)
- Governance is all-or-nothing (no intermediate validation layer)
The Medallion Solution
graph LR
A["Operational<br/>Systems"]
B["🥉 Bronze<br/>(Raw)"]
C["🥈 Silver<br/>(Cleaned)"]
D["🥇 Gold<br/>(Business)"]
E["BI & Analytics"]
A -->|"ELT: Extract & Load<br/>(minimal transformation)"| B
B -->|"Data quality,<br/>deduplication"| C
C -->|"Business logic,<br/>conformed dims"| D
D -->|"Validated, reliable"| E
Layer 1: Bronze Layer (Raw Data)
Purpose: Ingest data from operational systems in its rawest form — minimal transformation.
Characteristics
| Aspect | Detail |
|---|---|
| Data Quality | As-is from source (quality is not guaranteed) |
| Schema | Source schema + technical metadata (extraction date, source ID) |
| Duplication | Duplicate rows possible (not deduplicated yet) |
| Update Strategy | Append-only (SCD Type 6: keep all versions) |
| Ownership | Platform/DataOps team |
| Consumer Grade | Not for end-users; for technical teams debugging |
Example: Customer Bronze Table
-- BRONZE: Raw dump from source system
CREATE TABLE IF NOT EXISTS bronze.customers (
_elt_timestamp TIMESTAMP, -- When loaded
_elt_source_id STRING, -- Which source system
-- Exact columns from source
customer_id INT,
name STRING,
email STRING,
phone STRING,
address STRING,
age INT,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
-- Append-only: preserve all versions for audit
INSERT INTO bronze.customers
SELECT
current_timestamp() as _elt_timestamp,
'source_system_1' as _elt_source_id,
t.* FROM raw_source.customers t;
Design Principles
✅ Minimize transformation
- Extract exactly as-is from source
- Add only technical metadata (timestamp, source ID)
✅ Append-only pattern
- Never UPDATE/DELETE raw data
- Preserves history for debugging and compliance
✅ Fault tolerance
- If downstream fails, re-run from bronze without re-extracting source
- Bronze is replayed version of source
✅ Schema flexibility
- Schema-on-read acceptable (messy schemas okay)
- Technical team manually documents deviations
What Bronze Saves
If analytical logic breaks:
1. Check Silver layer (was data quality the issue?)
2. Check Bronze layer (was source data wrong?)
→ No need to re-extract from production system
→ Can trace root cause to source or transformation
Layer 2: Silver Layer (Cleansed & Integrated)
Purpose: Standardize, deduplicate, and validate data for business use.
Characteristics
| Aspect | Detail |
|---|---|
| Data Quality | Validated (nulls handled, duplicates removed, constraints checked) |
| Schema | Business-standard schema (consistent naming, types) |
| Duplication | Deduplicated (business keys unique) |
| Update Strategy | Primarily SCD Type 1 (update in-place) or Type 2 (track history) |
| Ownership | Data quality team + analytics team |
| Consumer Grade | Yes; internal tools and data scientists can use directly |
Example: Customer Silver Table
-- SILVER: Clean, deduplicated, business-ready
CREATE TABLE IF NOT EXISTS silver.customers (
customer_id INT,
customer_key STRING UNIQUE, -- Deduplication key
-- Clean, standardized columns
first_name STRING,
last_name STRING,
email STRING,
country_code STRING, -- Validated against domain list
age_group STRING, -- Binned from age with null handling
created_date DATE, -- Parse to DATE, handle nulls
-- Data quality flags
is_valid BOOLEAN, -- Passed all validations
data_quality_issues STRING, -- CSV of problems (if any)
-- SCD Type 2: Track changes
effective_date DATE,
end_date DATE,
is_current BOOLEAN,
_elt_timestamp TIMESTAMP
);
-- Transformation: Bronze → Silver
MERGE INTO silver.customers t
USING (
SELECT
customer_id,
CONCAT(name, '_', customer_id) as customer_key,
SPLIT_PART(name, ' ', 1) as first_name,
SPLIT_PART(name, ' ', 2) as last_name,
LOWER(email) as email,
COALESCE(UPPER(SUBSTRING(address, -2)), 'XX') as country_code,
CASE
WHEN age IS NULL THEN 'Unknown'
WHEN age < 18 THEN 'Teen'
WHEN age < 65 THEN 'Adult'
ELSE 'Senior'
END as age_group,
CAST(created_at AS DATE) as created_date,
-- Data quality check
CASE
WHEN email IS NULL OR email NOT LIKE '%@%' THEN FALSE
WHEN customer_id IS NULL THEN FALSE
ELSE TRUE
END as is_valid,
CASE
WHEN email IS NULL THEN 'missing_email'
WHEN email NOT LIKE '%@%' THEN 'invalid_email'
ELSE NULL
END as data_quality_issues,
current_date() as effective_date,
null as end_date,
true as is_current,
current_timestamp() as _elt_timestamp
FROM bronze.customers b
) s
ON t.customer_id = s.customer_id AND t.is_current = TRUE
WHEN MATCHED AND (
t.email != s.email OR
t.age_group != s.age_group
) THEN UPDATE SET
is_current = FALSE,
end_date = current_date()
WHEN NOT MATCHED THEN INSERT *;
Key Responsibilities
| Task | Owner | Frequency |
|---|---|---|
| Deduplication | Quality team | Daily (detect dups from source) |
| Type validation | Quality team | Daily (ensure DATE is DATE, not string) |
| Domain validation | Quality team | Daily (country codes valid, emails formatted) |
| Null handling | Quality team | Daily (replace with defaults or flags) |
| SCD Type 2 | Analytics team | Daily (track changes for historical analysis) |
| Integration | Analytics team | Weekly (join multiple source dimensions) |
Layer 3: Gold Layer (Business-Ready)
Purpose: Model data for specific business domains using dimensional modeling (star schema).
Characteristics
| Aspect | Detail |
|---|---|
| Data Quality | Guaranteed (multiple validation layers below) |
| Schema | Dimensional (facts + conformed dimensions, star schema) |
| Organized By | Business use case (sales, customer, marketing) |
| Ownership | Business analytics team / Domain experts |
| Consumer Grade | ✅ Yes; business intelligence, dashboards, data science |
Example: Sales Gold Models
-- GOLD: Fact table from clean Silver data
CREATE TABLE IF NOT EXISTS gold.fact_sales (
sale_key STRING PRIMARY KEY,
-- Dimensional references
customer_key STRING,
product_key STRING,
date_key INT, -- References dim_date
-- Additive measures
quantity INT,
amount DECIMAL(18, 2),
cost DECIMAL(18, 2),
discount_amount DECIMAL(18, 2),
-- Non-additive (for context)
unit_price DECIMAL(18, 2),
created_timestamp TIMESTAMP,
updated_timestamp TIMESTAMP
);
-- GOLD: Conformed dimension (integrated from multiple sources)
CREATE TABLE IF NOT EXISTS gold.dim_customer (
customer_key STRING PRIMARY KEY,
-- Degenerate & descriptive attributes
customer_id INT,
name STRING,
email STRING,
country_code STRING,
segment STRING, -- Computed from Silver data
lifetime_value DECIMAL, -- Computed metric
-- SCD Type 2: Track changes
effective_date DATE,
end_date DATE,
is_current BOOLEAN
);
-- Build Gold from Silver
INSERT INTO gold.fact_sales
SELECT
CONCAT(s.sale_id, '_', s.customer_id, '_', s.product_id) as sale_key,
c.customer_key,
p.product_key,
d.date_key,
s.quantity,
s.amount,
s.cost,
s.discount_amount,
s.unit_price,
current_timestamp(),
current_timestamp()
FROM silver.sales s
JOIN silver.customers c ON s.customer_id = c.customer_id
JOIN silver.products p ON s.product_id = p.product_id
JOIN silver.dates d ON DATE(s.sale_date) = d.date_value;
Conformed Dimensions
Gold layer uses conformed dimensions — shared, integrated dimensions across all fact tables.
gold.dim_date
├── Shared by: fact_sales, fact_inventory, fact_website_visits
├── Contains: day_key, date, month, quarter, year, fiscal_period
└── Ensures: All fact tables use the same time logic
gold.dim_customer
├── Shared by: fact_sales, fact_support_tickets, fact_marketing_campaign
├── Contains: customer_key, name, segment, lifetime_value
└── Ensures: Analytics across domains uses consistent customer definition
Data Flow Visualization
graph TB
A["Operational<br/>System"]
B["🥉 Bronze Layer<br/>(Raw Data Store)"]
B1["bronze.customers<br/>bronze.orders<br/>bronze.products"]
C["Data Quality<br/>Checks"]
D["🥈 Silver Layer<br/>(Clean & Standardized)"]
D1["silver.customers<br/>silver.orders<br/>silver.products"]
E["Business<br/>Logic"]
F["🥇 Gold Layer<br/>(Dimensional Models)"]
F1["fact_sales<br/>dim_customer<br/>dim_product"]
F2["fact_inventory<br/>dim_store<br/>dim_date"]
G["Analytics & BI"]
A -->|"Dump<br/>(minimal transform)"| B
B --> B1
B1 -->|"Dedup,<br/>validate"| C
C --> D
D --> D1
D1 -->|"Dimensional<br/>models"| E
E --> F
F --> F1
F --> F2
F1 --> G
F2 --> G
Governance Model: Who Owns What?
| Layer | Owned By | Responsibility | SLA |
|---|---|---|---|
| Bronze | Data/Platform team | Raw ingest, schema preservation | 99.9% uptime |
| Silver | Data quality team | Validation, deduplication, standardization | 4-hour SLA errors |
| Gold | Analytics/Domain team | Dimensional models, business logic, documentation | 2-hour SLA errors |
Common Patterns
Pattern 1: Medallion + Lakehouse
Medallion layers can live in:
- Data Lake (Bronze): Cheap object storage (S3, GCS, ADLS)
- Data Warehouse (Silver/Gold): Managed SQL (BigQuery, Snowflake)
Bronze: ADLS (unstructured, Parquet files)
↓
Silver: BigQuery (semi-structured, validation rules)
↓
Gold: BigQuery (highly structured, star schema)
Pattern 2: Medallion + Data Vault
Some organizations combine medallion with data vault concepts:
- Bronze: Raw tables (1:1 with source)
- Silver: Hub/Link/Satellite tables (data vault structure)
- Gold: Business-ready flattened dimensions & facts
Key Decision: SCD Strategy per Layer
| Layer | SCD Type | Rationale |
|---|---|---|
| Bronze | None (append-only) | Preserve all historical data for debugging |
| Silver | Type 1 or 2 | Type 2 if historical changes matter for analytics |
| Gold | Type 2 | Track dimension changes for time-series accuracy |
Deep-Dives
→ Deep Dive: Bronze Layer — Schema-on-read, schema inference, dealing with format changes.
→ Deep Dive: Silver Layer — Data quality frameworks, deduplication strategies, SCD Type 2 implementation.
→ Deep Dive: Gold Layer — Dimensional modeling for Gold, conformed dimensions, optimization.
Key Takeaways
✅ Medallion architecture separates concerns: raw ingestion (Bronze) → validation (Silver) → business models (Gold).
✅ Bronze layer is append-only, preserving all data for debugging without re-extraction.
✅ Silver layer enforces data quality and standardization before business consumption.
✅ Gold layer uses dimensional modeling (star schema) for business-ready analytics.
✅ Clear ownership at each layer ensures accountability and scalability.
✅ Governance by layer allows different SLAs and update strategies (append vs. merge).
Practice Questions
- Why is Bronze layer append-only?
- What's the difference between Silver and Gold layer data quality guarantees?
- What's a "conformed dimension" and why does it matter for Gold layer?
- How does medallion architecture help with debugging data quality issues?
- When would you choose SCD Type 2 in Silver vs. Gold layer?