Deep Dive: Bronze Layer - Raw Data Ingestion & Schema-on-Read
Level: Intermediate Pre-reading: 03 · Medallion Architecture Time to read: 15 min
Bronze Layer: The Raw Data Store
Purpose: Preserve Original, Transform Later
Bronze layer is your data audit trail. It asks: "What did the source system send us?"
Source System (PostgreSQL):
CREATE TABLE customers (
id INT,
name VARCHAR,
email VARCHAR,
created_at TIMESTAMP
);
Bronze Layer (Data Lake):
CREATE TABLE bronze.customers (
_elt_timestamp TIMESTAMP, -- When we loaded it
_elt_source_id STRING, -- Which system sent it
-- Exact replica
id INT,
name VARCHAR,
email VARCHAR,
created_at TIMESTAMP
);
Design Principles
Principle 1: Append-Only, Never Modify
-- ✅ GOOD: Append only
INSERT INTO bronze.customers
SELECT
CURRENT_TIMESTAMP() as _elt_timestamp,
'postgres_prod' as _elt_source_id,
t.*
FROM source.customers t;
-- ❌ BAD: Overwrite existing data
TRUNCATE TABLE bronze.customers;
INSERT INTO bronze.customers ...;
-- (Loses audit trail!)
-- ❌ BAD: Update individual records
UPDATE bronze.customers SET email = 'new@example.com' WHERE id = 5;
-- (Can't debug root cause anymore)
Why? If quality is broken, trace back to source:
- Bronze still has original data
- Re-run quality checks from Bronze
- No need to re-extract from production
Principle 2: Use Surrogate Partition Column
-- ✅ GOOD: Partition by extraction date
CREATE TABLE bronze.customers
USING PARQUET
PARTITIONED BY (extraction_date DATE)
AS SELECT ...;
-- Partitions: 2024-01-01/, 2024-01-02/, ...
-- Fast to find specific day's data
-- Daily load
INSERT INTO bronze.customers
PARTITION (extraction_date = '2024-01-15')
SELECT * FROM source.customers;
Principle 3: Technical Metadata, Not Transformation
-- ✅ GOOD: Only technical stuff
INSERT INTO bronze.customers (
_elt_timestamp,
_elt_source_id,
_elt_row_number, -- Row order
_elt_source_file, -- Which file/export
-- Raw columns, unchanged
id, name, email, created_at
) VALUES ...;
-- ❌ BAD: Transform in Bronze
INSERT INTO bronze.customers (
_elt_timestamp,
customer_key, -- ← Don't do this!
name_upper, -- ← Don't do this!
parsed_email_domain, -- ← Don't do this!
...
);
-- (Belongs in Silver, not Bronze!)
Schema-on-Read: Handling Messy Sources
Problem: Heterogeneous Source Schemas
API v1 sends:
{customer_id, name, email}
API v2 sends:
{customer_id, first_name, last_name, email, phone}
API v3 sends:
{customer_id, full_name, email_addresses: []}
Bronze layer receives all:
└─ How do you load into single table?
Solution: Schema-on-Read Pattern
Store raw JSON/Parquet (flexible schema), parse in Silver:
-- Bronze: Flexible Parquet (schema-less)
CREATE TABLE bronze.customers_raw (
_elt_timestamp TIMESTAMP,
_elt_source STRING,
_elt_source_file STRING,
raw_json STRING -- Store entire JSON as string
);
-- Insert all variants
INSERT INTO bronze.customers_raw VALUES
(CURRENT_TIMESTAMP(), 'api_v1', 'customers_2024-01-15.json', '{"customer_id":1,"name":"Alice"}'),
(CURRENT_TIMESTAMP(), 'api_v2', 'customers_2024-01-15.json', '{"customer_id":2,"first_name":"Bob","last_name":"Jones"}'),
(CURRENT_TIMESTAMP(), 'api_v3', 'customers_2024-01-15.json', '{"customer_id":3,"full_name":"Charlie"}');
-- Silver: Parse and standardize
WITH parsed AS (
SELECT
_elt_timestamp,
JSON_EXTRACT_SCALAR(raw_json, '$.customer_id') as customer_id,
COALESCE(
JSON_EXTRACT_SCALAR(raw_json, '$.name'),
JSON_EXTRACT_SCALAR(raw_json, '$.full_name'),
CONCAT(
JSON_EXTRACT_SCALAR(raw_json, '$.first_name'), ' ',
JSON_EXTRACT_SCALAR(raw_json, '$.last_name')
)
) as name
FROM bronze.customers_raw
)
INSERT INTO silver.customers
SELECT
customer_id,
name,
_elt_timestamp
FROM parsed;
Benefit: Bronze remains a pure audit log (raw JSON), Silver handles parsing variations.
Handling Format Changes
Scenario: Source adds new field
Week 1: Source API sends {id, name, email}
Week 2: Source API adds {phone} to the payload
Bronze handling:
SCHEMA INFERENCE (dynamic schema):
└─ New column automatically detected
└─ Older records: phone = NULL
└─ No pipeline change needed
With static schema:
└─ Pipeline fails (unexpected column)
└─ Manual fix required
Recommendation: Use dynamic/inferred schema in Bronze
# PySpark: Auto-infer schema
df = spark.read.option("inferSchema", "true") \
.json("s3://raw/customers/*.json") \
.write.parquet("s3://bronze/customers")
# Handles new fields gracefully
Deduplication Analysis (Bronze → Silver)
Bronze layer detects duplicate distribution:
-- Bronze: Raw data with potential duplicates
SELECT
customer_id,
COUNT(*) as count,
COUNT(DISTINCT name) as distinct_names,
COUNT(DISTINCT email) as distinct_emails
FROM bronze.customers
GROUP BY 1
HAVING count > 1
ORDER BY 2 DESC;
-- Output:
-- | 5 | 3 | 2 | 1 | ← customer_id=5 appears 3x, 2 different names
-- | 7 | 2 | 1 | 2 | ← customer_id=7 appears 2x, different emails
Root cause analysis:
- Duplicates in source system? (bug in source app)
- Duplicate extraction? (extraction logic re-ran)
- Late-arriving data? (event arrived twice)
Solution in Silver:
-- Deduplication strategy
INSERT INTO silver.customers
SELECT
customer_id,
MAX(name) as name, -- or FIRST_VALUE(name) by timestamp
MAX(email) as email,
MAX(created_at) as created_at
FROM bronze.customers
WHERE extraction_date = CURRENT_DATE
GROUP BY 1; -- Deduplicate by business key
-- Logs duplicates for root cause analysis
INSERT INTO data_quality.duplicate_report
SELECT
'customers' as table_name,
customer_id,
COUNT(*) as duplicate_count
FROM bronze.customers
WHERE extraction_date = CURRENT_DATE
GROUP BY 1, 2
HAVING COUNT(*) > 1;
Handling Late-Arriving Data in Bronze
-- Bronze: Preserve arrival order
INSERT INTO bronze.orders (
_elt_timestamp, -- 2024-01-20 (loaded today)
_elt_source_date, -- 2024-01-01 (order date)
order_id,
amount
) VALUES
(CURRENT_TIMESTAMP(), '2024-01-01', 12345, 99.99), -- Timely
(CURRENT_TIMESTAMP(), '2023-12-30', 12340, 149.99); -- Late (20 days)
-- Silver: Identify late arrivals
SELECT
DATE(order_date) as order_date,
COUNT(*) as num_orders,
COUNT(CASE WHEN order_date < '2024-01-15' THEN 1 END) as late_arriving,
ROUND(100 * COUNT(CASE WHEN order_date < '2024-01-15' THEN 1 END) / COUNT(*), 2) as pct_late
FROM silver.orders
WHERE DATE(_elt_timestamp) = CURRENT_DATE
GROUP BY 1;
Compliance & Retention
PII in Bronze?
-- Bronze stores raw data including PII
CREATE TABLE bronze.customers (
_elt_timestamp TIMESTAMP,
customer_id INT,
name VARCHAR, -- PII
email VARCHAR, -- PII
phone VARCHAR, -- PII
ssn VARCHAR -- PII
);
-- Access control: Who can read Bronze?
-- - Data engineers (need to see raw for debugging)
-- - Data quality team (need to see raw for audits)
-- - NOT analysts (apply masking in Silver/Gold)
-- Retention: How long keep Bronze?
-- - Typical: 90 days (cost of raw storage)
-- - Some: 1 year (compliance requirement)
-- - Archive: Older data → cloud storage (cheaper)
Archival Example
-- Monthly archival process
EXPORT DATA OPTIONS(
uri='gs://archive-bucket/bronze/customers/2023-12-*/**.parquet',
format='PARQUET'
) AS
SELECT * FROM bronze.customers
WHERE DATE(_elt_timestamp) BETWEEN '2023-12-01' AND '2023-12-31';
-- Delete from hot storage (save cost)
DELETE FROM bronze.customers
WHERE DATEDIFF(day, _elt_timestamp, CURRENT_DATE) > 90;
Bronze Layer Monitoring
Check daily for quality issues at source:
-- Data Quality: Bronze Layer SLA
SELECT
table_name,
DATE(_elt_timestamp) as load_date,
COUNT(*) as row_count,
COUNT(DISTINCT CAST(raw_json AS STRING)) as distinct_rows,
ROUND(100 * (1 - COUNT(DISTINCT CAST(raw_json AS STRING)) / COUNT(*)), 2) as duplicate_pct,
MIN(_elt_timestamp) as first_load,
MAX(_elt_timestamp) as last_load
FROM bronze.tables
WHERE DATE(_elt_timestamp) = CURRENT_DATE
GROUP BY 1, 2;
Bronze → Silver → Gold Data Flow Summary
┌─────────────────────────────────────────────┐
│ Bronze → Analyze │
│ "What did source send?" │
├─────────────────────────────────────────────┤
│ ✅ If good → proceed to Silver │
│ ❌ If bad → alert data owner │
│ └─ Re-run quality in Silver │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Silver → Transform │
│ "Clean and integrate" │
├─────────────────────────────────────────────┤
│ Deduplication, type casting, null handling │
│ ✅ Pass quality gate │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ Gold → Serve │
│ "Business-ready models" │
├─────────────────────────────────────────────┤
│ Dimensional models, BI-ready │
│ ✅ Analytics & dashboards │
└─────────────────────────────────────────────┘
Key Takeaways
✅ Bronze is append-only — never modify, only add.
✅ Schema-on-read handles heterogeneous sources.
✅ Technical metadata only (_elt_timestamp, _elt_source).
✅ Deduplication happens in Silver — not Bronze.
✅ Preserve history — trace bugs back to source.
✅ Retention & archival — 90 days hot, older → cold storage.
Practice
- Design a Bronze layer for 3 different source systems (API, JSON, CSV).
- Detect duplicates in Bronze and trace root cause.
- Implement late-arriving data tracking.
- Set up PII access controls for Bronze.