9.03 · Deep Dive: Data Lineage & Impact Analysis

Level: Advanced Time to read: 15 min Pre-reading: 09 · Enterprise Data Management After reading: You'll understand lineage tracking, impact analysis, compliance automation, and how to trace data dependencies.


Data Lineage: The Complete Picture

Data Lineage = Complete record of how data flows through your systems

Source System → Extract → Transform → Load → BI Tool
     ↓                        ↓
Operational   (lineage tracks every step)   Dashboard
Database                                    

Questions lineage answers:
✅ Where did this data come from?
✅ How many transformations happened?
✅ What happens if I delete this?
✅ Who can see this data?
✅ When was it last updated?

Types of Lineage

Upstream Lineage (What feeds into this?)

gold_revenue_by_product
    ├─ fact_sales
    │   ↑
    │   ├─ silver_orders
    │   │   ↑
    │   │   └─ bronze_orders
    │   │       ↑
    │   │       └─ source_ecommerce.orders
    │   │
    │   └─ silver_customers
    │       ↑
    │       └─ bronze_customers
    │           ↑
    │           └─ source_salesforce.accounts
    └─ dim_product
        └─ bronze_products
            └─ source_ecommerce.products

Timeline: 3 transformations, 3 source systems

Downstream Lineage (What depends on this?)

bronze_customers
    ├─ silver_customers
    │   ↓
    │   ├─ gold_dim_customer
    │   │   ↓
    │   │   ├─ Dashboard: Customer Overview
    │   │   ├─ Report: Customer Segmentation
    │   │   └─ Fact table: fact_sales
    │   │       ↓
    │   │       ├─ Dashboard: Sales Performance
    │   │       └─ ML Model: Churn Prediction
    │   │
    │   └─ Metadata catalog entry
    └─ Data quality dashboard

Impact: Changes to bronze_customers affect 5+ downstream assets

Lineage Capture Methods

# docs/data_lineage.yml
tables:
  gold_dim_customer:
    sources:
      - silver_customers (primary)
      - silver_customer_segments (enrichment)
    transformations:
      - Type conversions (email to lowercase)
      - Joins (geography dimension)
      - SCD Type 2 tracking
    downstream:
      - Dashboard: Customer Overview
      - Dashboard: Segment Analysis
      - Fact table: fact_sales
    owner: VP Sales
    certified: true
    last_updated: 2024-04-23

# Problems:
# ❌ Manually updated (stale)
# ❌ No automation (errors)
# ❌ No source control (what changed?)
-- dbt automatically tracks lineage through ref() and source()

-- models/gold/gold_dim_customer.sql
{{
  config(
    materialized='table',
    tags=['daily', 'critical']
  )
}}

SELECT
  c.customer_id,
  c.email,
  s.segment,
  g.country
FROM {{ ref('silver_customers') }} c  -- Lineage: silver_customers → this model
LEFT JOIN {{ ref('silver_customer_segments') }} s
  ON c.customer_id = s.customer_id
LEFT JOIN {{ ref('dim_geography') }} g
  ON c.country_code = g.country_code

-- dbt generates:
# Lineage graph:
#   source.ecommerce.customers  silver_customers  gold_dim_customer
#   silver_customer_segments  gold_dim_customer
#   silver_geography  gold_dim_customer

# Dependencies:
#   upstream: 3 sources/models
#   downstream: all models that ref(this)

# Visualization:
#   dbt docs shows DAG automatically

Method 3: OpenLineage (Platform Agnostic)

# OpenLineage: Standard for lineage across tools
# Works with Airflow, Spark, dbt, Fivetran, etc.

from openlineage.client.run import RunEvent, RunState
from openlineage.client.transport import Transport

# When Spark job runs:
event = RunEvent(
    eventTime="2024-04-23T10:00:00Z",
    run=Run(
        runId="job-123-run-001",
        facets={"parent": ParentRunFacet(runId="dag-456")}
    ),
    job=Job(
        namespace="spark://cluster",
        name="transform_orders"
    ),
    inputs=[
        InputDataset(
            namespace="snowflake://prod",
            name="source_orders"
        )
    ],
    outputs=[
        OutputDataset(
            namespace="snowflake://prod",
            name="silver_orders"
        )
    ],
    producer="https://openlineage.io/producers/spark"
)

# Send to data catalog
transport.emit(event)

# Result: Automatic lineage tracking across all tools

Impact Analysis: "What if I change X?"

Scenario: Column Rename

Q: What if I rename customer_email → email?

Impact Analysis:
├─ Direct dependents:
│  ├─ Dashboard: Customer Overview (uses customer_email)
│  ├─ Report: Email Marketing (uses customer_email)
│  ├─ Fact table: fact_sales (joins using customer_email)
│  ├─ ML Model: Churn Predictor (feature: customer_email)
│  └─ 3 other models
├─ Indirect dependents:
│  ├─ Executive Dashboard (depends on fact_sales)
│  ├─ ML API (serves churn model)
│  └─ 12 other downstream assets
├─ Risk Assessment:
│  ├─ HIGH: Production dashboards will break
│  ├─ MEDIUM: ML model needs retraining
│  ├─ LOW: Internal reports need update
└─ Recommendation:
   ├─ Create alias: customer_email = email (backward compatible)
   ├─ 3-month deprecation period (notify users)
   ├─ Complete cutover only after all downstream updated
   ├─ Avoid breaking changes to production

Automated Impact Queries

-- Query: What's affected by this change?
CREATE PROCEDURE impact_analysis(
  table_name STRING,
  column_name STRING
)
LANGUAGE PYTHON
AS $$
  # Find all downstream dependencies

  # Step 1: Find direct dependents
  direct_queries = catalog.get_references(table_name, column_name)
  # Result: 8 models, 5 dashboards, 2 reports

  # Step 2: Find indirect dependents (transitive)
  indirect = set()
  for dependent in direct_queries:
    indirect.extend(catalog.get_references(dependent.name))
  # Result: 15 additional downstream assets

  # Step 3: Classify by risk
  for asset in direct_queries + indirect:
    risk = classify_risk(asset)
    # CRITICAL: Production dashboards, ML models
    # HIGH: Reports
    # MEDIUM: Data quality checks
    # LOW: Metadata only

  # Step 4: Generate report
  print(f"Impact of {table_name}.{column_name}:")
  print(f"  Direct: {len(direct_queries)} assets")
  print(f"  Indirect: {len(indirect)} assets")
  print(f"  High risk: {count_high_risk} assets need urgent updates")

$$

-- Run impact analysis before making changes
CALL impact_analysis('dim_customer', 'email');

Lineage in Action: Incident Response

Scenario: Bad Data in Gold Layer

Incident: gold_dim_customer has 10K duplicate records

Step 1: Identify source
  └─ Lineage question: "Where did these come from?"
     ├─ gold_dim_customer ← silver_customers ← bronze_customers ← source_salesforce
     └─ Traced to: bronze_customers (duplicates from source system)

Step 2: Assess impact
  └─ Lineage question: "What's affected?"
     ├─ Direct: 5 dashboards using gold_dim_customer
     ├─ Indirect: 15 reports depending on dashboards
     ├─ Critical: fact_sales (used by 20+ queries)
     └─ Total: 40 downstream assets affected

Step 3: Prioritize fix
  └─ Severity by criticality:
     ├─ CRITICAL: fact_sales (used by executives)
     ├─ HIGH: gold_dim_customer (data warehouse core)
     ├─ MEDIUM: silver_customers (internal)
     └─ Source: Fix is at bronze level (add deduplication)

Step 4: Execute fix
  └─ Change bronze deduplication logic
  └─ Rerun silver/gold transformations
  └─ Notify downstream consumers:
     ├─ Dashboard owners: "Data updated, refresh cache"
     ├─ Report users: "Results may differ, rerun queries"
     └─ ML engineers: "Model may need retraining"

Step 5: Verification
  └─ Confirm: All 40 downstream assets show correct results

Compliance & Lineage

GDPR Right to Be Forgotten (Lineage Required)

Q: User requests deletion (right to be forgotten)

Step 1: Find user data
  └─ Query: "Find all PII for user_id = 12345"
     ├─ dim_customer.email
     ├─ fact_sales.customer_id
     ├─ ML model features (derived from customer_id)
     └─ Backup tables (7-day retention)

Step 2: Find downstream data
  └─ Lineage question: "Where else is this user's data?"
     ├─ Aggregations: "Revenue by customer" (includes customer 12345)
     ├─ ML Models: "Churn model trained on customer 12345"
     ├─ Reports: "Top 100 customers (may include 12345)"
     └─ Archives: Backups created during 7-day window

Step 3: Delete consistently
  └─ Delete from all locations:
     ├─ Source system (Salesforce)
     ├─ Bronze layer (append-only, add deletion flag)
     ├─ Silver layer (recompute without user 12345)
     ├─ Gold layer (recompute aggregations)
     ├─ ML models (retrain or flag as stale)
     ├─ Backups (overwrite older versions)
     └─ Data quality logs (remove from audit)

Step 4: Verify deletion
  └─ Query all systems: "SELECT * WHERE user_id = 12345"
     Result: No rows (except immutable audit logs with "DELETED" flag)

Without lineage:
  ❌ Don't know all locations where data exists
  ❌ Leave PII in backups/models
  ❌ Non-compliance with GDPR (potential fines)

Lineage Tools

Tool Type Strengths
dbt docs Open source Automatic from dbt, free, integrated
Collibra Enterprise UI, governance, manual curation
Apache Atlas Open source Big Data ecosystems, Hadoop/Spark
Alation Enterprise AI-driven discovery, crowd-sourced
OpenMetadata Open source Multi-tool lineage, emerging standard
Cloud native Cloud BigQuery Lineage, Snowflake Data Sharing

Lineage Best Practices

Auto-capture where possible (dbt, Airflow, cloud-native) ✅ Start with dbt (free, built-in, covers 80% of use cases) ✅ Maintain manual docs for non-dbt pipelines ✅ Regular audits (quarterly: is lineage still accurate?) ✅ Impact analysis before changes (prevent breaking changes) ✅ Archive lineage (keep historical versions for compliance)

Don't rely on memory (lineage changes, documentation stales) ❌ Don't ignore indirect dependencies (cascading failures) ❌ Don't delete lineage data (needed for compliance/audit)


Key Takeaways

  1. Lineage tracks data flow from source to BI/ML
  2. Upstream lineage answers "where did it come from?"
  3. Downstream lineage answers "what depends on this?"
  4. dbt is easiest way to capture lineage (free, automatic)
  5. Impact analysis prevents breaking changes
  6. Compliance requires lineage (GDPR, CCPA, audits)
  7. OpenLineage is emerging standard for cross-tool lineage