Deep Dive: Bronze Layer - Raw Data Ingestion & Schema-on-Read

Level: Intermediate Pre-reading: 03 · Medallion Architecture Time to read: 15 min


Bronze Layer: The Raw Data Store

Purpose: Preserve Original, Transform Later

Bronze layer is your data audit trail. It asks: "What did the source system send us?"

Source System (PostgreSQL):
  CREATE TABLE customers (
    id INT,
    name VARCHAR,
    email VARCHAR,
    created_at TIMESTAMP
  );

Bronze Layer (Data Lake):
  CREATE TABLE bronze.customers (
    _elt_timestamp TIMESTAMP,    -- When we loaded it
    _elt_source_id STRING,       -- Which system sent it

    -- Exact replica
    id INT,
    name VARCHAR,
    email VARCHAR,
    created_at TIMESTAMP
  );

Design Principles

Principle 1: Append-Only, Never Modify

-- ✅ GOOD: Append only
INSERT INTO bronze.customers
SELECT 
  CURRENT_TIMESTAMP() as _elt_timestamp,
  'postgres_prod' as _elt_source_id,
  t.* 
FROM source.customers t;

-- ❌ BAD: Overwrite existing data
TRUNCATE TABLE bronze.customers;
INSERT INTO bronze.customers ...;
-- (Loses audit trail!)

-- ❌ BAD: Update individual records
UPDATE bronze.customers SET email = 'new@example.com' WHERE id = 5;
-- (Can't debug root cause anymore)

Why? If quality is broken, trace back to source:

  • Bronze still has original data
  • Re-run quality checks from Bronze
  • No need to re-extract from production

Principle 2: Use Surrogate Partition Column

-- ✅ GOOD: Partition by extraction date
CREATE TABLE bronze.customers
USING PARQUET
PARTITIONED BY (extraction_date DATE)
AS SELECT ...;

-- Partitions: 2024-01-01/, 2024-01-02/, ...
-- Fast to find specific day's data

-- Daily load
INSERT INTO bronze.customers
PARTITION (extraction_date = '2024-01-15')
SELECT * FROM source.customers;

Principle 3: Technical Metadata, Not Transformation

-- ✅ GOOD: Only technical stuff
INSERT INTO bronze.customers (
  _elt_timestamp,
  _elt_source_id,
  _elt_row_number,    -- Row order
  _elt_source_file,   -- Which file/export

  -- Raw columns, unchanged
  id, name, email, created_at
) VALUES ...;

-- ❌ BAD: Transform in Bronze
INSERT INTO bronze.customers (
  _elt_timestamp,
  customer_key,        -- ← Don't do this!
  name_upper,          -- ← Don't do this!
  parsed_email_domain, -- ← Don't do this!
  ...
);
-- (Belongs in Silver, not Bronze!)

Schema-on-Read: Handling Messy Sources

Problem: Heterogeneous Source Schemas

API v1 sends:
  {customer_id, name, email}

API v2 sends:
  {customer_id, first_name, last_name, email, phone}

API v3 sends:
  {customer_id, full_name, email_addresses: []}

Bronze layer receives all:
  └─ How do you load into single table?

Solution: Schema-on-Read Pattern

Store raw JSON/Parquet (flexible schema), parse in Silver:

-- Bronze: Flexible Parquet (schema-less)
CREATE TABLE bronze.customers_raw (
  _elt_timestamp TIMESTAMP,
  _elt_source STRING,
  _elt_source_file STRING,
  raw_json STRING  -- Store entire JSON as string
);

-- Insert all variants
INSERT INTO bronze.customers_raw VALUES
  (CURRENT_TIMESTAMP(), 'api_v1', 'customers_2024-01-15.json', '{"customer_id":1,"name":"Alice"}'),
  (CURRENT_TIMESTAMP(), 'api_v2', 'customers_2024-01-15.json', '{"customer_id":2,"first_name":"Bob","last_name":"Jones"}'),
  (CURRENT_TIMESTAMP(), 'api_v3', 'customers_2024-01-15.json', '{"customer_id":3,"full_name":"Charlie"}');

-- Silver: Parse and standardize
WITH parsed AS (
  SELECT 
    _elt_timestamp,
    JSON_EXTRACT_SCALAR(raw_json, '$.customer_id') as customer_id,
    COALESCE(
      JSON_EXTRACT_SCALAR(raw_json, '$.name'),
      JSON_EXTRACT_SCALAR(raw_json, '$.full_name'),
      CONCAT(
        JSON_EXTRACT_SCALAR(raw_json, '$.first_name'), ' ',
        JSON_EXTRACT_SCALAR(raw_json, '$.last_name')
      )
    ) as name
  FROM bronze.customers_raw
)

INSERT INTO silver.customers
SELECT 
  customer_id,
  name,
  _elt_timestamp
FROM parsed;

Benefit: Bronze remains a pure audit log (raw JSON), Silver handles parsing variations.


Handling Format Changes

Scenario: Source adds new field

Week 1: Source API sends {id, name, email}
Week 2: Source API adds {phone} to the payload

Bronze handling:
  SCHEMA INFERENCE (dynamic schema):
    └─ New column automatically detected
    └─ Older records: phone = NULL
    └─ No pipeline change needed

With static schema:
    └─ Pipeline fails (unexpected column)
    └─ Manual fix required

Recommendation: Use dynamic/inferred schema in Bronze

# PySpark: Auto-infer schema
df = spark.read.option("inferSchema", "true") \
  .json("s3://raw/customers/*.json") \
  .write.parquet("s3://bronze/customers")

# Handles new fields gracefully

Deduplication Analysis (Bronze → Silver)

Bronze layer detects duplicate distribution:

-- Bronze: Raw data with potential duplicates
SELECT 
  customer_id,
  COUNT(*) as count,
  COUNT(DISTINCT name) as distinct_names,
  COUNT(DISTINCT email) as distinct_emails
FROM bronze.customers
GROUP BY 1
HAVING count > 1
ORDER BY 2 DESC;

-- Output:
-- | 5 | 3 | 2 | 1 |  ← customer_id=5 appears 3x, 2 different names
-- | 7 | 2 | 1 | 2 |  ← customer_id=7 appears 2x, different emails

Root cause analysis:

  • Duplicates in source system? (bug in source app)
  • Duplicate extraction? (extraction logic re-ran)
  • Late-arriving data? (event arrived twice)

Solution in Silver:

-- Deduplication strategy
INSERT INTO silver.customers
SELECT 
  customer_id,
  MAX(name) as name,           -- or FIRST_VALUE(name) by timestamp
  MAX(email) as email,
  MAX(created_at) as created_at
FROM bronze.customers
WHERE extraction_date = CURRENT_DATE
GROUP BY 1;  -- Deduplicate by business key

-- Logs duplicates for root cause analysis
INSERT INTO data_quality.duplicate_report
SELECT 
  'customers' as table_name,
  customer_id,
  COUNT(*) as duplicate_count
FROM bronze.customers
WHERE extraction_date = CURRENT_DATE
GROUP BY 1, 2
HAVING COUNT(*) > 1;


Handling Late-Arriving Data in Bronze

-- Bronze: Preserve arrival order
INSERT INTO bronze.orders (
  _elt_timestamp,       -- 2024-01-20 (loaded today)
  _elt_source_date,     -- 2024-01-01 (order date)
  order_id,
  amount
) VALUES
  (CURRENT_TIMESTAMP(), '2024-01-01', 12345, 99.99),  -- Timely
  (CURRENT_TIMESTAMP(), '2023-12-30', 12340, 149.99); -- Late (20 days)

-- Silver: Identify late arrivals
SELECT 
  DATE(order_date) as order_date,
  COUNT(*) as num_orders,
  COUNT(CASE WHEN order_date < '2024-01-15' THEN 1 END) as late_arriving,
  ROUND(100 * COUNT(CASE WHEN order_date < '2024-01-15' THEN 1 END) / COUNT(*), 2) as pct_late
FROM silver.orders
WHERE DATE(_elt_timestamp) = CURRENT_DATE
GROUP BY 1;

Compliance & Retention

PII in Bronze?

-- Bronze stores raw data including PII
CREATE TABLE bronze.customers (
  _elt_timestamp TIMESTAMP,
  customer_id INT,
  name VARCHAR,        -- PII
  email VARCHAR,       -- PII
  phone VARCHAR,       -- PII
  ssn VARCHAR          -- PII
);

-- Access control: Who can read Bronze?
-- - Data engineers (need to see raw for debugging)
-- - Data quality team (need to see raw for audits)
-- - NOT analysts (apply masking in Silver/Gold)

-- Retention: How long keep Bronze?
-- - Typical: 90 days (cost of raw storage)
-- - Some: 1 year (compliance requirement)
-- - Archive: Older data → cloud storage (cheaper)

Archival Example

-- Monthly archival process
EXPORT DATA OPTIONS(
  uri='gs://archive-bucket/bronze/customers/2023-12-*/**.parquet',
  format='PARQUET'
) AS
SELECT * FROM bronze.customers
WHERE DATE(_elt_timestamp) BETWEEN '2023-12-01' AND '2023-12-31';

-- Delete from hot storage (save cost)
DELETE FROM bronze.customers 
WHERE DATEDIFF(day, _elt_timestamp, CURRENT_DATE) > 90;

Bronze Layer Monitoring

Check daily for quality issues at source:

-- Data Quality: Bronze Layer SLA
SELECT 
  table_name,
  DATE(_elt_timestamp) as load_date,
  COUNT(*) as row_count,
  COUNT(DISTINCT CAST(raw_json AS STRING)) as distinct_rows,
  ROUND(100 * (1 - COUNT(DISTINCT CAST(raw_json AS STRING)) / COUNT(*)), 2) as duplicate_pct,
  MIN(_elt_timestamp) as first_load,
  MAX(_elt_timestamp) as last_load
FROM bronze.tables
WHERE DATE(_elt_timestamp) = CURRENT_DATE
GROUP BY 1, 2;

Bronze → Silver → Gold Data Flow Summary

┌─────────────────────────────────────────────┐
│ Bronze → Analyze                            │
│ "What did source send?"                     │
├─────────────────────────────────────────────┤
│ ✅ If good  → proceed to Silver             │
│ ❌ If bad   → alert data owner              │
│    └─ Re-run quality in Silver              │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Silver → Transform                          │
│ "Clean and integrate"                       │
├─────────────────────────────────────────────┤
│ Deduplication, type casting, null handling  │
│ ✅ Pass quality gate                        │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Gold → Serve                                │
│ "Business-ready models"                     │
├─────────────────────────────────────────────┤
│ Dimensional models, BI-ready               │
│ ✅ Analytics & dashboards                   │
└─────────────────────────────────────────────┘

Key Takeaways

Bronze is append-only — never modify, only add.

Schema-on-read handles heterogeneous sources.

Technical metadata only (_elt_timestamp, _elt_source).

Deduplication happens in Silver — not Bronze.

Preserve history — trace bugs back to source.

Retention & archival — 90 days hot, older → cold storage.


Practice

  1. Design a Bronze layer for 3 different source systems (API, JSON, CSV).
  2. Detect duplicates in Bronze and trace root cause.
  3. Implement late-arriving data tracking.
  4. Set up PII access controls for Bronze.