09 · Enterprise Data Management: Governance, Lineage, Quality, Security

Level: Advanced Time to read: 22 min After reading: You'll understand how to manage data warehouses at organizational scale: governance frameworks, data quality, lineage tracking, and security controls.


The Enterprise Problem: Data As An Asset

Startup Phase (Year 0):
└─ 1 analyst, 1 data warehouse, "it works!"

Growth Phase (Year 2):
├─ 50 analysts
├─ 5 warehouses (BigQuery, Snowflake, Postgres)
├─ 100+ data models
├─ Questions:
│  ├─ Who owns this data?
│  ├─ Is this data quality certified?
│  ├─ Does PII need masking?
│  ├─ Which models depend on this table?
│  └─ Why did my query cost $50K this month?
└─ No answers → chaos!

Solution: Enterprise Data Management framework.


Data Governance: Who Decides What?

Governance Framework: RACI Model

Component Responsible Accountable Consulted Informed
Data Quality Data quality team Data owner Technical lead All users
Data Access Security team Data owner Privacy officer Requesters
Metadata Data catalog team Business owner Analytics lead All teams
Cost control Finance Platform lead Data consumers Executives

Role Definitions

Data Owner (DataViz/Analytics Lead)

  • Defines business rules
  • Certifies data quality
  • Approves access requests
  • Example: "VP Sales owns the customers dimension"

Data Steward (Data Team)

  • Implements owner's policies
  • Produces and maintains data
  • Responds to data quality issues
  • Example: "Data engineer implements customer dimension"

Data Consumer (Business Users)

  • Uses data for analysis
  • Reports data quality issues
  • Complies with access policies
  • Example: "Marketing analyst uses customer dimension"

Data Quality Framework

Three Levels of Quality

Level 1: Bronze Layer (Raw)
  └─ Quality: Unknown (as-is from source)

Level 2: Silver Layer (Validated)
  └─ Quality: Certified (passed checks)
    ├─ Uniqueness: customer_id is unique
    ├─ Completeness: 99.9%+ non-null for required fields
    ├─ Accuracy: Spot checks against source
    └─ Timeliness: Loaded within 4 hours

Level 3: Gold Layer (Business-Ready)
  └─ Quality: SLA guaranteed
    ├─ Accuracy: 99.95%+ (tested nightly)
    ├─ Freshness: Loaded by 6 AM daily
    ├─ Completeness: 100% for business-critical fields
    └─ Consistency: Matches source within tolerance

Implementing Quality Checks

# dbt: Automated quality tests
version: 2

models:
  - name: gold_fact_sales
    description: "Daily sales facts (SLA: 99.95% accuracy)"

    columns:
      - name: sale_id
        tests:
          - unique
          - not_null

      - name: customer_id
        tests:
          - not_null
          - relationships:  # Foreign key check
              to: ref('gold_dim_customer')
              field: customer_id

      - name: amount
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0
              max_value: 1000000  # Sanity check

      - name: sale_date
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_within_set:
              value_set: ['2023-01-01', '2024-12-31']

    tests:
      - dbt_utils.recency:
          datepart: day
          field: sale_date
          interval: 1  # Data loaded within 1 day

Quality Metrics: The Dashboard

-- Data Quality Scorecard (daily)
SELECT 
  table_name,
  ROW_COUNT,
  NULL_PERCENT,
  DUPLICATE_PERCENT,
  FRESHNESS_HOURS,
  QUALITY_SCORE,
  QUALITY_STATUS
FROM data_quality_metrics
WHERE QUALITY_SCORE < 95
ORDER BY QUALITY_SCORE ASC;

-- Output:
-- | gold_fact_sales | 125M | 0.1% | 0.0% | 2hrs | 99.8% | ✅ PASS |
-- | gold_dim_store  |  5K  | 0.2% | 0.0% |  3hrs| 99.7% | ✅ PASS |
-- | silver_customers| 10M | 5.0% | 2.0% | 8hrs | 88.5% | ⚠️  WARN |

Data Lineage: Tracking Origins

Why Lineage Matters

Bug found: Customer revenue is 10% too high in reports

Investigation:
  1. Check Gold layer (fact_sales) ✓ data looks correct
  2. Check Silver layer (joined dimensions) ✓ dimensions correct
  3. Check Bronze layer (extracted data) ✗ Found double-counting!
    → Source system has duplicate orders since 2024-01-15

Without lineage:
  └─ Cannot trace issue back to source (hours of debugging)

With lineage:
  └─ Click "show lineage" → immediately see Bronze has duplicates

Building Lineage: Data Catalog

dbt (Data Build Tool) provides lineage automatically:

┌─────────────────┐
│  raw.customers  │
└────────┬────────┘
┌────────▼────────────┐
│ stg_customers       │ ← Staging layer
│ (clean, validate)   │
└────────┬────────────┘
      ┌──┴──┐
      │     │
┌─────▼┐  ┌─┴──────────┐
│Orders│  │Customer ID♦│
└─────┬┘  └─┬──────────┘
      │     │
┌─────┴─────▼─────────┐
│ fact_sales          │ ← Gold fact table
│ (denormalized)      │
└─────────────────────┘
      ┌──┴──────────┐
      │             │
┌─────▼────┐ ┌──────▼──────┐
│ Looker   │ │ Jupyter NB  │
└──────────┘ └─────────────┘

dbt produces:

  • Lineage graph (visual DAG)
  • Impact analysis ("What breaks if I change this table?")
  • Documentation with column-level lineage

Metadata Management: The Single Source of Truth

What is Metadata?

Data = "Alice, 123 Main St, Alice@example.com"
Metadata = "Alice is a String (name), St is Address (location), email..."

Metadata Catalog (Collateral, Datahub, Apache Atlas)

Table: gold_dim_customer
├─ Owner: VP Sales
├─ Freshness: Daily 2 AM
├─ Accuracy: 99.95%
├─ PII: Yes (email, address)
├─ Sensitivity: Internal only
├─ Documentation: "Customer master dimension, SCD Type 2"
├─ Columns:
│  ├─ customer_id (INT, NOT NULL) — Business key
│  ├─ customer_name (VARCHAR 500) — Full name
│  ├─ email (VARCHAR) — PII, encrypt at rest
│  └─ segment (VARCHAR) — 'Premium' | 'Standard' | At-Risk'
├─ Upstream:
│  ├─ silver.customers
│  ├─ silver.support_tiers
│  └─ silver.marketing_segments
└─ Downstream:
   ├─ fact_sales (joined for revenue reports)
   ├─ fact_support (joined for NPS analysis)
   └─ Looker dashboard "Sales by Segment"

Benefits

Self-service: Users can find data without asking
Trust: See owner, quality, freshness in one place
Governance: Enforce PII masking, retention policies
Impact analysis: Know what breaks if you change something


Data Security & Privacy: Protecting PII

Classification: What Data Needs Protection?

PII (Personally Identifiable Information):
  └─ Examples: name, email, phone, SSN, address, IP address
  └─ Rules: Encrypt at rest, masked for dev, access audit trail

PHI (Protected Health Info):
  └─ Examples: medical records, diagnosis, treatment
  └─ Rules: HIPAA compliance required

Financial Data:
  └─ Examples: credit card, salary, transactions
  └─ Rules: SOX compliance, immutable logs

Public Data:
  └─ Examples: product catalog, published reports
  └─ Rules: No special protection needed

Implementation: Access Control

-- Option 1: Row-level security (RLS)
CREATE POLICY customer_privacy ON gold_dim_customer
  USING (customer_id IN (
    SELECT customer_id FROM user_access_permissions 
    WHERE user_id = current_user_id()
  ));

-- Query: Even with SELECT *, only sees authorized rows
SELECT * FROM gold_dim_customer;
-- Result: Only customers the user is authorized to see

Implementation: Column-Level Security

-- Option 2: Mask sensitive columns
CREATE VIEW gold_dim_customer_safe AS
SELECT 
  customer_id,
  customer_name,
  CASE 
    WHEN current_user_role = 'ANALYST' 
    THEN '***@example.com'
    ELSE email
  END as email,
  segment
FROM gold_dim_customer;

-- Analyst sees masked email
-- Admin sees real email

Implementation: Data Redaction

# Python: Redact SSN in logs
import re

def redact_pii(text):
    ssn_pattern = r'\d{3}-\d{2}-\d{4}'
    return re.sub(ssn_pattern, 'XXX-XX-XXXX', text)

# Before: "Customer 123 with SSN 123-45-6789"
# After: "Customer 123 with SSN XXX-XX-XXXX"

Cost Management: Following the Money

Cost Breakdown: Where Does Money Go?

BigQuery Project: $100K/month

├─ Data scanning: $75K (majority)
│  └─ 12 TB × $6.25/TB = $75K
│  └─ Cause: Inefficient queries, no partition pruning
├─ Storage: $15K
│  └─ 2.4 PB × $6.25/month = $15K
│  └─ Cause: Not deleting stale data
├─ Slots (reserved compute): $8K
│  └─ 100 slots reserved × $0.04/slot/hour × 730 hrs
│  └─ Cause: Predictable workload, reserved better than on-demand
└─ Export/egress: $2K
   └─ Data transferred OUT of Google Cloud
   └─ Cause: Exporting to data lakes

Optimization Strategies

Strategy Impact Effort
Partition pruning (WHERE on partition column) -40% Low
Column selection (SELECT only needed columns) -30% Low
Clustering on foreign keys -20% Medium
Materialized views (pre-aggregate common queries) -50% for those queries Medium
Data retention policy (delete > 1 year old) -20% storage Low
Reserved slots (annual commitment) -40% compute cost Medium

Cost Monitoring

-- BigQuery: Monitor project-level costs
SELECT 
  DATE(creation_time) as query_date,
  user_email,
  COUNT(*) as num_queries,
  SUM(ROUND(total_bytes_billed/1e9, 2)) as total_gb_billed,
  SUM(ROUND(total_bytes_billed/1e12, 4)) * 6.25 as est_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY 1, 2
ORDER BY 5 DESC;

-- Alert: If user_email's costs spike, investigate

Data Retention & Archival: Managing History

Retention Policy:

Gold layer: 5 years (SLA required)
Silver layer: 2 years (for debugging)
Bronze layer: 90 days (raw data cost-prohibitive)

Why?:
  - Gold: Business/legal requirements (5-year audits)
  - Silver: Good enough for recent debugging (2 years)
  - Bronze: Cost savings (raw data cheap to regenerate)

Implementation

-- Automated archival: Move old data to cheap storage

-- Month-end job: Archive old partitions
ALTER TABLE gold.fact_sales SET TBLPROPERTIES (
  'parquet.compression'='snappy'  -- Compress before archival
);

-- Delete partitions older than 5 years
DELETE FROM gold.fact_sales
WHERE sale_date < DATE_SUB(CURRENT_DATE(), INTERVAL 5 YEAR);

-- Export to Cloud Storage (cheaper storage)
EXPORT DATA OPTIONS(
  uri='gs://archive-bucket/fact_sales/*.parquet',
  format='PARQUET'
) AS
SELECT * FROM gold.fact_sales
WHERE sale_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 YEAR)
  AND sale_date < DATE_SUB(CURRENT_DATE(), INTERVAL 5 YEAR);

Summary: Enterprise Data Management Stack

graph TB
    A["Data Governance<br/>RACI Framework"]
    B["Data Quality<br/>dbt Tests"]
    C["Lineage Tracking<br/>dbt + Data Catalog"]
    D["Metadata Management<br/>Collateral/Datahub"]
    E["Security & Privacy<br/>RLS, Masking"]
    F["Cost Management<br/>Monitoring"]

    A -.-> B
    B -.-> C
    C -.-> D
    D -.-> E
    E -.-> F

Deep-Dives

Deep Dive: Data Governance — RACI, ownership models, policy enforcement.

Deep Dive: Data Quality — Test frameworks, SLAs, monitoring dashboards.

Deep Dive: Data Lineage — Lineage tools, impact analysis, compliance automation.


Key Takeaways

Data governance requires clear ownership (data owner vs. steward vs. consumer).

Quality is measured via tests and SLAs, not gut feeling.

Lineage tracking enables fast debugging and compliance.

Metadata catalog is the single source of truth.

Security & privacy require role-based access, masking, encryption.

Cost is manageable with monitoring, retention policies, optimization.


Practice Questions

  1. Define the RACI for a data table ownership model.
  2. Design a data quality framework with 3 levels (Bronze, Silver, Gold).
  3. How would you implement column-level security for PII?
  4. What metadata should be tracked for every table?
  5. How would you investigate the root cause of a data quality issue using lineage?