9.03 · Deep Dive: Data Lineage & Impact Analysis
Level: Advanced Time to read: 15 min Pre-reading: 09 · Enterprise Data Management After reading: You'll understand lineage tracking, impact analysis, compliance automation, and how to trace data dependencies.
Data Lineage: The Complete Picture
Data Lineage = Complete record of how data flows through your systems
Source System → Extract → Transform → Load → BI Tool
↓ ↓
Operational (lineage tracks every step) Dashboard
Database
Questions lineage answers:
✅ Where did this data come from?
✅ How many transformations happened?
✅ What happens if I delete this?
✅ Who can see this data?
✅ When was it last updated?
Types of Lineage
Upstream Lineage (What feeds into this?)
gold_revenue_by_product
↑
├─ fact_sales
│ ↑
│ ├─ silver_orders
│ │ ↑
│ │ └─ bronze_orders
│ │ ↑
│ │ └─ source_ecommerce.orders
│ │
│ └─ silver_customers
│ ↑
│ └─ bronze_customers
│ ↑
│ └─ source_salesforce.accounts
│
└─ dim_product
↑
└─ bronze_products
↑
└─ source_ecommerce.products
Timeline: 3 transformations, 3 source systems
Downstream Lineage (What depends on this?)
bronze_customers
↓
├─ silver_customers
│ ↓
│ ├─ gold_dim_customer
│ │ ↓
│ │ ├─ Dashboard: Customer Overview
│ │ ├─ Report: Customer Segmentation
│ │ └─ Fact table: fact_sales
│ │ ↓
│ │ ├─ Dashboard: Sales Performance
│ │ └─ ML Model: Churn Prediction
│ │
│ └─ Metadata catalog entry
│
└─ Data quality dashboard
Impact: Changes to bronze_customers affect 5+ downstream assets
Lineage Capture Methods
Method 1: Manual Documentation (Not Recommended)
# docs/data_lineage.yml
tables:
gold_dim_customer:
sources:
- silver_customers (primary)
- silver_customer_segments (enrichment)
transformations:
- Type conversions (email to lowercase)
- Joins (geography dimension)
- SCD Type 2 tracking
downstream:
- Dashboard: Customer Overview
- Dashboard: Segment Analysis
- Fact table: fact_sales
owner: VP Sales
certified: true
last_updated: 2024-04-23
# Problems:
# ❌ Manually updated (stale)
# ❌ No automation (errors)
# ❌ No source control (what changed?)
Method 2: Automated from dbt (Recommended)
-- dbt automatically tracks lineage through ref() and source()
-- models/gold/gold_dim_customer.sql
{{
config(
materialized='table',
tags=['daily', 'critical']
)
}}
SELECT
c.customer_id,
c.email,
s.segment,
g.country
FROM {{ ref('silver_customers') }} c -- Lineage: silver_customers → this model
LEFT JOIN {{ ref('silver_customer_segments') }} s
ON c.customer_id = s.customer_id
LEFT JOIN {{ ref('dim_geography') }} g
ON c.country_code = g.country_code
-- dbt generates:
# Lineage graph:
# source.ecommerce.customers → silver_customers → gold_dim_customer
# silver_customer_segments → gold_dim_customer
# silver_geography → gold_dim_customer
# Dependencies:
# upstream: 3 sources/models
# downstream: all models that ref(this)
# Visualization:
# dbt docs shows DAG automatically
Method 3: OpenLineage (Platform Agnostic)
# OpenLineage: Standard for lineage across tools
# Works with Airflow, Spark, dbt, Fivetran, etc.
from openlineage.client.run import RunEvent, RunState
from openlineage.client.transport import Transport
# When Spark job runs:
event = RunEvent(
eventTime="2024-04-23T10:00:00Z",
run=Run(
runId="job-123-run-001",
facets={"parent": ParentRunFacet(runId="dag-456")}
),
job=Job(
namespace="spark://cluster",
name="transform_orders"
),
inputs=[
InputDataset(
namespace="snowflake://prod",
name="source_orders"
)
],
outputs=[
OutputDataset(
namespace="snowflake://prod",
name="silver_orders"
)
],
producer="https://openlineage.io/producers/spark"
)
# Send to data catalog
transport.emit(event)
# Result: Automatic lineage tracking across all tools
Impact Analysis: "What if I change X?"
Scenario: Column Rename
Q: What if I rename customer_email → email?
Impact Analysis:
├─ Direct dependents:
│ ├─ Dashboard: Customer Overview (uses customer_email)
│ ├─ Report: Email Marketing (uses customer_email)
│ ├─ Fact table: fact_sales (joins using customer_email)
│ ├─ ML Model: Churn Predictor (feature: customer_email)
│ └─ 3 other models
│
├─ Indirect dependents:
│ ├─ Executive Dashboard (depends on fact_sales)
│ ├─ ML API (serves churn model)
│ └─ 12 other downstream assets
│
├─ Risk Assessment:
│ ├─ HIGH: Production dashboards will break
│ ├─ MEDIUM: ML model needs retraining
│ ├─ LOW: Internal reports need update
│
└─ Recommendation:
├─ Create alias: customer_email = email (backward compatible)
├─ 3-month deprecation period (notify users)
├─ Complete cutover only after all downstream updated
├─ Avoid breaking changes to production
Automated Impact Queries
-- Query: What's affected by this change?
CREATE PROCEDURE impact_analysis(
table_name STRING,
column_name STRING
)
LANGUAGE PYTHON
AS $$
# Find all downstream dependencies
# Step 1: Find direct dependents
direct_queries = catalog.get_references(table_name, column_name)
# Result: 8 models, 5 dashboards, 2 reports
# Step 2: Find indirect dependents (transitive)
indirect = set()
for dependent in direct_queries:
indirect.extend(catalog.get_references(dependent.name))
# Result: 15 additional downstream assets
# Step 3: Classify by risk
for asset in direct_queries + indirect:
risk = classify_risk(asset)
# CRITICAL: Production dashboards, ML models
# HIGH: Reports
# MEDIUM: Data quality checks
# LOW: Metadata only
# Step 4: Generate report
print(f"Impact of {table_name}.{column_name}:")
print(f" Direct: {len(direct_queries)} assets")
print(f" Indirect: {len(indirect)} assets")
print(f" High risk: {count_high_risk} assets need urgent updates")
$$
-- Run impact analysis before making changes
CALL impact_analysis('dim_customer', 'email');
Lineage in Action: Incident Response
Scenario: Bad Data in Gold Layer
Incident: gold_dim_customer has 10K duplicate records
Step 1: Identify source
└─ Lineage question: "Where did these come from?"
├─ gold_dim_customer ← silver_customers ← bronze_customers ← source_salesforce
└─ Traced to: bronze_customers (duplicates from source system)
Step 2: Assess impact
└─ Lineage question: "What's affected?"
├─ Direct: 5 dashboards using gold_dim_customer
├─ Indirect: 15 reports depending on dashboards
├─ Critical: fact_sales (used by 20+ queries)
└─ Total: 40 downstream assets affected
Step 3: Prioritize fix
└─ Severity by criticality:
├─ CRITICAL: fact_sales (used by executives)
├─ HIGH: gold_dim_customer (data warehouse core)
├─ MEDIUM: silver_customers (internal)
└─ Source: Fix is at bronze level (add deduplication)
Step 4: Execute fix
└─ Change bronze deduplication logic
└─ Rerun silver/gold transformations
└─ Notify downstream consumers:
├─ Dashboard owners: "Data updated, refresh cache"
├─ Report users: "Results may differ, rerun queries"
└─ ML engineers: "Model may need retraining"
Step 5: Verification
└─ Confirm: All 40 downstream assets show correct results
Compliance & Lineage
GDPR Right to Be Forgotten (Lineage Required)
Q: User requests deletion (right to be forgotten)
Step 1: Find user data
└─ Query: "Find all PII for user_id = 12345"
├─ dim_customer.email
├─ fact_sales.customer_id
├─ ML model features (derived from customer_id)
└─ Backup tables (7-day retention)
Step 2: Find downstream data
└─ Lineage question: "Where else is this user's data?"
├─ Aggregations: "Revenue by customer" (includes customer 12345)
├─ ML Models: "Churn model trained on customer 12345"
├─ Reports: "Top 100 customers (may include 12345)"
└─ Archives: Backups created during 7-day window
Step 3: Delete consistently
└─ Delete from all locations:
├─ Source system (Salesforce)
├─ Bronze layer (append-only, add deletion flag)
├─ Silver layer (recompute without user 12345)
├─ Gold layer (recompute aggregations)
├─ ML models (retrain or flag as stale)
├─ Backups (overwrite older versions)
└─ Data quality logs (remove from audit)
Step 4: Verify deletion
└─ Query all systems: "SELECT * WHERE user_id = 12345"
Result: No rows (except immutable audit logs with "DELETED" flag)
Without lineage:
❌ Don't know all locations where data exists
❌ Leave PII in backups/models
❌ Non-compliance with GDPR (potential fines)
Lineage Tools
| Tool | Type | Strengths |
|---|---|---|
| dbt docs | Open source | Automatic from dbt, free, integrated |
| Collibra | Enterprise | UI, governance, manual curation |
| Apache Atlas | Open source | Big Data ecosystems, Hadoop/Spark |
| Alation | Enterprise | AI-driven discovery, crowd-sourced |
| OpenMetadata | Open source | Multi-tool lineage, emerging standard |
| Cloud native | Cloud | BigQuery Lineage, Snowflake Data Sharing |
Lineage Best Practices
✅ Auto-capture where possible (dbt, Airflow, cloud-native) ✅ Start with dbt (free, built-in, covers 80% of use cases) ✅ Maintain manual docs for non-dbt pipelines ✅ Regular audits (quarterly: is lineage still accurate?) ✅ Impact analysis before changes (prevent breaking changes) ✅ Archive lineage (keep historical versions for compliance)
❌ Don't rely on memory (lineage changes, documentation stales) ❌ Don't ignore indirect dependencies (cascading failures) ❌ Don't delete lineage data (needed for compliance/audit)
Key Takeaways
- Lineage tracks data flow from source to BI/ML
- Upstream lineage answers "where did it come from?"
- Downstream lineage answers "what depends on this?"
- dbt is easiest way to capture lineage (free, automatic)
- Impact analysis prevents breaking changes
- Compliance requires lineage (GDPR, CCPA, audits)
- OpenLineage is emerging standard for cross-tool lineage