3.02 ยท Deep Dive: Silver Layer - Data Cleaning & Validation

Level: Intermediate to Advanced Time to read: 18 min Pre-reading: 03 ยท Medallion Architecture After reading: You'll understand the Silver layer's role in data quality, deduplication, and creating a single source of truth for downstream analytics.


Silver Layer: Purpose & Design

The Silver layer transforms raw Bronze data into validated, deduplicated, business-ready data.

Characteristics

Aspect Detail
Data Quality Certified (passed validation checks)
Schema Cleaned, deduplicated, conformed to business standards
Duplication Removed via deduplication logic
Update Strategy SCD Type 2 (track all versions)
Ownership Data engineering team
Consumer Grade For analytics engineers and data scientists

Silver Layer Transformation Pipeline

graph LR
    B["๐Ÿฅ‰ Bronze<br/>(Raw, As-Is)"]
    SC["Data Quality<br/>Checks"]
    DD["Deduplication<br/>Logic"]
    TR["Type<br/>Conversions"]
    JN["Join &<br/>Enrich"]
    S["๐Ÿฅˆ Silver<br/>(Validated,<br/>Deduplicated)"]

    B -->|Detect anomalies| SC
    SC -->|Remove duplicates| DD
    DD -->|Cast to business types| TR
    TR -->|Join reference data| JN
    JN -->|Load| S

Transformation 1: Data Quality Checks

What to Validate

-- Silver layer: Validated customers table
CREATE TABLE silver.customers (
  customer_id INT,
  email STRING,
  phone STRING,
  created_at TIMESTAMP,
  updated_at TIMESTAMP,
  _load_date DATE,
  _is_valid BOOLEAN,
  _quality_issues ARRAY<STRING>
);

-- Validation logic
INSERT INTO silver.customers
WITH validated AS (
  SELECT
    customer_id,
    email,
    phone,
    created_at,
    updated_at,
    CURRENT_DATE() as _load_date,
    CASE
      WHEN customer_id IS NULL THEN FALSE
      WHEN email NOT LIKE '%@%.%' THEN FALSE
      WHEN phone NOT LIKE '+%' AND phone IS NOT NULL THEN FALSE
      WHEN created_at > CURRENT_TIMESTAMP() THEN FALSE
      ELSE TRUE
    END as _is_valid,
    CASE
      WHEN customer_id IS NULL THEN ['missing_customer_id']
      WHEN email NOT LIKE '%@%.%' THEN ['invalid_email']
      WHEN created_at > CURRENT_TIMESTAMP() THEN ['future_created_date']
      ELSE []
    END as _quality_issues
  FROM bronze.customers
)
SELECT * FROM validated;

Quality Dimensions (DAMA Framework)

Dimension Definition Example Check
Completeness Required fields are not null email IS NOT NULL
Uniqueness Business keys have no duplicates COUNT(customer_id) = COUNT(DISTINCT customer_id)
Timeliness Data is fresh enough created_at < CURRENT_TIMESTAMP()
Accuracy Values match business rules email LIKE '%@%.%'
Consistency Data matches across sources customer_id in gold.customers matches silver.customers

Transformation 2: Deduplication

Pattern: Remove Exact Duplicates

-- Deduplication using ROW_NUMBER()
WITH deduped AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id, email
      ORDER BY _elt_timestamp DESC
    ) as rn
  FROM bronze.customers
)
SELECT * EXCEPT (rn)
FROM deduped
WHERE rn = 1;  -- Keep only the latest version

Pattern: Handle Incremental Duplicates

-- When the same customer appears twice in one load
-- (e.g., bug in source system produces duplicate rows)

WITH deduped AS (
  SELECT
    customer_id,
    email,
    phone,
    -- Keep the row with most recent update
    CASE 
      WHEN ROW_NUMBER() OVER (
        PARTITION BY customer_id 
        ORDER BY updated_at DESC, _elt_timestamp DESC
      ) = 1 THEN TRUE 
      ELSE FALSE 
    END as is_current
  FROM bronze.customers
)
SELECT *
FROM deduped
WHERE is_current = TRUE;

Transformation 3: Type Conversions & Normalization

Standardize Data Types

-- Convert to business-friendly types
CREATE OR REPLACE TABLE silver.customers AS
SELECT
  CAST(customer_id AS INT64) as customer_id,
  LOWER(TRIM(email)) as email,  -- Normalize email (lowercase, trim)
  REGEXP_REPLACE(phone, '[^0-9+]', '') as phone,  -- Remove formatting
  CAST(created_at AS TIMESTAMP) as created_at,
  CAST(updated_at AS TIMESTAMP) as updated_at,

  -- Add technical metadata
  CURRENT_DATE() as _load_date,
  1 as _dbt_version  -- Track transformation version
FROM bronze.customers
WHERE _is_valid = TRUE;  -- Only keep validated rows

Common Conversions

Type From โ†’ To Example
Email Raw โ†’ Lowercase, trimmed "JOHN@EXAMPLE.COM " โ†’ "john@example.com"
Phone Raw โ†’ Standardized "(555) 123-4567" โ†’ "+15551234567"
Date String โ†’ DATE/TIMESTAMP "2024-01-15" โ†’ 2024-01-15
Currency String โ†’ Decimal "$1,234.56" โ†’ 1234.56
Boolean String/Int โ†’ Boolean "Y"/"N" โ†’ TRUE/FALSE, 1/0 โ†’ TRUE/FALSE

Transformation 4: Joining Reference Data

Enrich with Business Context

-- Silver layer: Add business context from dimension tables
CREATE OR REPLACE TABLE silver.customers_enriched AS
SELECT
  c.customer_id,
  c.email,
  c.phone,
  c.created_at,

  -- Enrich with geography reference
  g.country_name,
  g.region_id,

  -- Enrich with segment reference
  s.customer_segment,
  s.lifetime_value_bucket,

  c._load_date
FROM silver.customers c
LEFT JOIN silver.dim_geography g
  ON c.country_code = g.country_code
LEFT JOIN silver.customer_segments s
  ON c.customer_id = s.customer_id
WHERE c._load_date = CURRENT_DATE();

Silver Layer Materialization Strategy

-- SCD Type 6: Keep all versions with effective dates
CREATE TABLE silver.customers_scd (
  customer_key INT64,  -- Surrogate key
  customer_id INT64,   -- Business key
  email STRING,
  phone STRING,

  -- SCD Type 2 Tracking
  effective_date DATE,
  end_date DATE,
  is_current BOOLEAN,

  _load_date DATE,
  _version INT
);

-- Insert new versions
INSERT INTO silver.customers_scd
WITH latest_bronze AS (
  SELECT DISTINCT
    customer_id,
    email,
    phone,
    CURRENT_DATE() as effective_date,
    NULL as end_date,
    TRUE as is_current
  FROM bronze.customers
  WHERE _is_valid = TRUE
)
SELECT
  GENERATE_UNIQUE_ID() as customer_key,
  customer_id,
  email,
  phone,
  effective_date,
  end_date,
  is_current,
  CURRENT_DATE() as _load_date,
  1 as _version
FROM latest_bronze;

Option 2: Merge Pattern (Idempotent)

-- MERGE: Update existing, insert new
MERGE INTO silver.customers_scd t
USING (
  SELECT DISTINCT
    customer_id,
    email,
    phone,
    CURRENT_DATE() as effective_date
  FROM bronze.customers
  WHERE _is_valid = TRUE
) s
ON t.customer_id = s.customer_id AND t.is_current = TRUE
WHEN MATCHED AND t.email != s.email THEN
  UPDATE SET
    end_date = CURRENT_DATE() - 1,
    is_current = FALSE
WHEN NOT MATCHED THEN
  INSERT (customer_key, customer_id, email, phone, effective_date, end_date, is_current, _load_date)
  VALUES (GENERATE_UNIQUE_ID(), s.customer_id, s.email, s.phone, s.effective_date, NULL, TRUE, CURRENT_DATE());

Silver Layer Testing

dbt Tests

# dbt/models/silver/silver_customers.yml
version: 2

models:
  - name: silver_customers
    description: "Validated and deduplicated customer dimension"
    columns:
      - name: customer_id
        tests:
          - not_null
          - unique
          - relationships:
              to: ref('bronze_customers')
              field: customer_id

      - name: email
        tests:
          - not_null
          - custom_regex:
              pattern: '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

      - name: created_at
        tests:
          - not_null
          - custom:
              expression: '{{ column_name }} < current_timestamp()'

Silver vs. Gold Layer

Aspect Silver Gold
Purpose Validated, deduplicated source of truth Business-ready analytics tables
Transformations Quality checks, deduplication, type conversion Business logic, aggregations, conformed dimensions
Update Strategy SCD Type 2 (track all versions) SCD Type 1 or 2 depending on use case
Consumer Analytics engineers, data scientists Business analysts, BI tools
Freshness Near real-time (hours) Batch (daily)
Schema Technical (source-aligned) Business (domain-aligned)

Common Silver Layer Patterns

Pattern 1: Fan-Out (One Source to Many Outputs)

bronze.customers
    โ”œโ”€โ†’ silver.customers (SCD Type 2)
    โ”œโ”€โ†’ silver.customer_emails (deduplicated)
    โ””โ”€โ†’ silver.customer_addresses (deduplicated, joined with reference)

Pattern 2: Complex Deduplication

When source has duplicates from multiple reasons:

WITH scored_dupes AS (
  SELECT
    *,
    CASE
      WHEN _data_quality_score > 0.9 THEN 1  -- High quality = keep first
      ELSE 2  -- Lower quality = keep latest update
    END as priority
  FROM bronze.customers
)
SELECT *
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY priority ASC, updated_at DESC
    ) as rn
  FROM scored_dupes
)
WHERE rn = 1;

Best Practices

โœ… Always validate before deduplication

  • Catch quality issues early
  • Document why records are invalid

โœ… Keep historical data

  • SCD Type 2 for audit trail
  • Support "as of" analysis

โœ… Test aggressively

  • Uniqueness checks on business keys
  • Referential integrity tests

โœ… Document transformations

  • Why deduplication logic chosen
  • When rules changed
  • Impact on downstream consumers

โŒ Don't lose information

  • Never overwrite Bronze layer
  • Always SCD Type 2 for significant changes
  • Keep all versions with effective dates

Key Takeaways

  1. Silver layer validates and deduplicates raw Bronze data
  2. Quality is multi-dimensional: completeness, uniqueness, timeliness, accuracy, consistency
  3. SCD Type 2 pattern preserves history for debugging and compliance
  4. Deduplication is logic-specific: understand why records are duplicates
  5. Silver is single source of truth โ€” all downstream models should use Silver, not Bronze