3.02 ยท Deep Dive: Silver Layer - Data Cleaning & Validation
Level: Intermediate to Advanced Time to read: 18 min Pre-reading: 03 ยท Medallion Architecture After reading: You'll understand the Silver layer's role in data quality, deduplication, and creating a single source of truth for downstream analytics.
Silver Layer: Purpose & Design
The Silver layer transforms raw Bronze data into validated, deduplicated, business-ready data.
Characteristics
| Aspect | Detail |
|---|---|
| Data Quality | Certified (passed validation checks) |
| Schema | Cleaned, deduplicated, conformed to business standards |
| Duplication | Removed via deduplication logic |
| Update Strategy | SCD Type 2 (track all versions) |
| Ownership | Data engineering team |
| Consumer Grade | For analytics engineers and data scientists |
Silver Layer Transformation Pipeline
graph LR
B["๐ฅ Bronze<br/>(Raw, As-Is)"]
SC["Data Quality<br/>Checks"]
DD["Deduplication<br/>Logic"]
TR["Type<br/>Conversions"]
JN["Join &<br/>Enrich"]
S["๐ฅ Silver<br/>(Validated,<br/>Deduplicated)"]
B -->|Detect anomalies| SC
SC -->|Remove duplicates| DD
DD -->|Cast to business types| TR
TR -->|Join reference data| JN
JN -->|Load| S
Transformation 1: Data Quality Checks
What to Validate
-- Silver layer: Validated customers table
CREATE TABLE silver.customers (
customer_id INT,
email STRING,
phone STRING,
created_at TIMESTAMP,
updated_at TIMESTAMP,
_load_date DATE,
_is_valid BOOLEAN,
_quality_issues ARRAY<STRING>
);
-- Validation logic
INSERT INTO silver.customers
WITH validated AS (
SELECT
customer_id,
email,
phone,
created_at,
updated_at,
CURRENT_DATE() as _load_date,
CASE
WHEN customer_id IS NULL THEN FALSE
WHEN email NOT LIKE '%@%.%' THEN FALSE
WHEN phone NOT LIKE '+%' AND phone IS NOT NULL THEN FALSE
WHEN created_at > CURRENT_TIMESTAMP() THEN FALSE
ELSE TRUE
END as _is_valid,
CASE
WHEN customer_id IS NULL THEN ['missing_customer_id']
WHEN email NOT LIKE '%@%.%' THEN ['invalid_email']
WHEN created_at > CURRENT_TIMESTAMP() THEN ['future_created_date']
ELSE []
END as _quality_issues
FROM bronze.customers
)
SELECT * FROM validated;
Quality Dimensions (DAMA Framework)
| Dimension | Definition | Example Check |
|---|---|---|
| Completeness | Required fields are not null | email IS NOT NULL |
| Uniqueness | Business keys have no duplicates | COUNT(customer_id) = COUNT(DISTINCT customer_id) |
| Timeliness | Data is fresh enough | created_at < CURRENT_TIMESTAMP() |
| Accuracy | Values match business rules | email LIKE '%@%.%' |
| Consistency | Data matches across sources | customer_id in gold.customers matches silver.customers |
Transformation 2: Deduplication
Pattern: Remove Exact Duplicates
-- Deduplication using ROW_NUMBER()
WITH deduped AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY customer_id, email
ORDER BY _elt_timestamp DESC
) as rn
FROM bronze.customers
)
SELECT * EXCEPT (rn)
FROM deduped
WHERE rn = 1; -- Keep only the latest version
Pattern: Handle Incremental Duplicates
-- When the same customer appears twice in one load
-- (e.g., bug in source system produces duplicate rows)
WITH deduped AS (
SELECT
customer_id,
email,
phone,
-- Keep the row with most recent update
CASE
WHEN ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY updated_at DESC, _elt_timestamp DESC
) = 1 THEN TRUE
ELSE FALSE
END as is_current
FROM bronze.customers
)
SELECT *
FROM deduped
WHERE is_current = TRUE;
Transformation 3: Type Conversions & Normalization
Standardize Data Types
-- Convert to business-friendly types
CREATE OR REPLACE TABLE silver.customers AS
SELECT
CAST(customer_id AS INT64) as customer_id,
LOWER(TRIM(email)) as email, -- Normalize email (lowercase, trim)
REGEXP_REPLACE(phone, '[^0-9+]', '') as phone, -- Remove formatting
CAST(created_at AS TIMESTAMP) as created_at,
CAST(updated_at AS TIMESTAMP) as updated_at,
-- Add technical metadata
CURRENT_DATE() as _load_date,
1 as _dbt_version -- Track transformation version
FROM bronze.customers
WHERE _is_valid = TRUE; -- Only keep validated rows
Common Conversions
| Type | From โ To | Example |
|---|---|---|
| Raw โ Lowercase, trimmed | "JOHN@EXAMPLE.COM " โ "john@example.com" | |
| Phone | Raw โ Standardized | "(555) 123-4567" โ "+15551234567" |
| Date | String โ DATE/TIMESTAMP | "2024-01-15" โ 2024-01-15 |
| Currency | String โ Decimal | "$1,234.56" โ 1234.56 |
| Boolean | String/Int โ Boolean | "Y"/"N" โ TRUE/FALSE, 1/0 โ TRUE/FALSE |
Transformation 4: Joining Reference Data
Enrich with Business Context
-- Silver layer: Add business context from dimension tables
CREATE OR REPLACE TABLE silver.customers_enriched AS
SELECT
c.customer_id,
c.email,
c.phone,
c.created_at,
-- Enrich with geography reference
g.country_name,
g.region_id,
-- Enrich with segment reference
s.customer_segment,
s.lifetime_value_bucket,
c._load_date
FROM silver.customers c
LEFT JOIN silver.dim_geography g
ON c.country_code = g.country_code
LEFT JOIN silver.customer_segments s
ON c.customer_id = s.customer_id
WHERE c._load_date = CURRENT_DATE();
Silver Layer Materialization Strategy
Option 1: Append-Only Tables (Recommended)
-- SCD Type 6: Keep all versions with effective dates
CREATE TABLE silver.customers_scd (
customer_key INT64, -- Surrogate key
customer_id INT64, -- Business key
email STRING,
phone STRING,
-- SCD Type 2 Tracking
effective_date DATE,
end_date DATE,
is_current BOOLEAN,
_load_date DATE,
_version INT
);
-- Insert new versions
INSERT INTO silver.customers_scd
WITH latest_bronze AS (
SELECT DISTINCT
customer_id,
email,
phone,
CURRENT_DATE() as effective_date,
NULL as end_date,
TRUE as is_current
FROM bronze.customers
WHERE _is_valid = TRUE
)
SELECT
GENERATE_UNIQUE_ID() as customer_key,
customer_id,
email,
phone,
effective_date,
end_date,
is_current,
CURRENT_DATE() as _load_date,
1 as _version
FROM latest_bronze;
Option 2: Merge Pattern (Idempotent)
-- MERGE: Update existing, insert new
MERGE INTO silver.customers_scd t
USING (
SELECT DISTINCT
customer_id,
email,
phone,
CURRENT_DATE() as effective_date
FROM bronze.customers
WHERE _is_valid = TRUE
) s
ON t.customer_id = s.customer_id AND t.is_current = TRUE
WHEN MATCHED AND t.email != s.email THEN
UPDATE SET
end_date = CURRENT_DATE() - 1,
is_current = FALSE
WHEN NOT MATCHED THEN
INSERT (customer_key, customer_id, email, phone, effective_date, end_date, is_current, _load_date)
VALUES (GENERATE_UNIQUE_ID(), s.customer_id, s.email, s.phone, s.effective_date, NULL, TRUE, CURRENT_DATE());
Silver Layer Testing
dbt Tests
# dbt/models/silver/silver_customers.yml
version: 2
models:
- name: silver_customers
description: "Validated and deduplicated customer dimension"
columns:
- name: customer_id
tests:
- not_null
- unique
- relationships:
to: ref('bronze_customers')
field: customer_id
- name: email
tests:
- not_null
- custom_regex:
pattern: '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
- name: created_at
tests:
- not_null
- custom:
expression: '{{ column_name }} < current_timestamp()'
Silver vs. Gold Layer
| Aspect | Silver | Gold |
|---|---|---|
| Purpose | Validated, deduplicated source of truth | Business-ready analytics tables |
| Transformations | Quality checks, deduplication, type conversion | Business logic, aggregations, conformed dimensions |
| Update Strategy | SCD Type 2 (track all versions) | SCD Type 1 or 2 depending on use case |
| Consumer | Analytics engineers, data scientists | Business analysts, BI tools |
| Freshness | Near real-time (hours) | Batch (daily) |
| Schema | Technical (source-aligned) | Business (domain-aligned) |
Common Silver Layer Patterns
Pattern 1: Fan-Out (One Source to Many Outputs)
bronze.customers
โโโ silver.customers (SCD Type 2)
โโโ silver.customer_emails (deduplicated)
โโโ silver.customer_addresses (deduplicated, joined with reference)
Pattern 2: Complex Deduplication
When source has duplicates from multiple reasons:
WITH scored_dupes AS (
SELECT
*,
CASE
WHEN _data_quality_score > 0.9 THEN 1 -- High quality = keep first
ELSE 2 -- Lower quality = keep latest update
END as priority
FROM bronze.customers
)
SELECT *
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY priority ASC, updated_at DESC
) as rn
FROM scored_dupes
)
WHERE rn = 1;
Best Practices
โ Always validate before deduplication
- Catch quality issues early
- Document why records are invalid
โ Keep historical data
- SCD Type 2 for audit trail
- Support "as of" analysis
โ Test aggressively
- Uniqueness checks on business keys
- Referential integrity tests
โ Document transformations
- Why deduplication logic chosen
- When rules changed
- Impact on downstream consumers
โ Don't lose information
- Never overwrite Bronze layer
- Always SCD Type 2 for significant changes
- Keep all versions with effective dates
Key Takeaways
- Silver layer validates and deduplicates raw Bronze data
- Quality is multi-dimensional: completeness, uniqueness, timeliness, accuracy, consistency
- SCD Type 2 pattern preserves history for debugging and compliance
- Deduplication is logic-specific: understand why records are duplicates
- Silver is single source of truth โ all downstream models should use Silver, not Bronze