09 · Enterprise Data Management: Governance, Lineage, Quality, Security

Level: Advanced Time to read: 22 min After reading: You'll understand how to manage data warehouses at organizational scale: governance frameworks, data quality, lineage tracking, and security controls.

The Enterprise Problem: Data As An Asset

Startup Phase (Year 0):
└─ 1 analyst, 1 data warehouse, "it works!"

Growth Phase (Year 2):
├─ 50 analysts
├─ 5 warehouses (BigQuery, Snowflake, Postgres)
├─ 100+ data models
├─ Questions:
│  ├─ Who owns this data?
│  ├─ Is this data quality certified?
│  ├─ Does PII need masking?
│  ├─ Which models depend on this table?
│  └─ Why did my query cost $50K this month?
└─ No answers → chaos!

Solution: Enterprise Data Management framework.

Data Governance: Who Decides What?

Governance Framework: RACI Model

Component	Responsible	Accountable	Consulted	Informed
Data Quality	Data quality team	Data owner	Technical lead	All users
Data Access	Security team	Data owner	Privacy officer	Requesters
Metadata	Data catalog team	Business owner	Analytics lead	All teams
Cost control	Finance	Platform lead	Data consumers	Executives

Role Definitions

Data Owner (DataViz/Analytics Lead)

Defines business rules
Certifies data quality
Approves access requests
Example: "VP Sales owns the customers dimension"

Data Steward (Data Team)

Implements owner's policies
Produces and maintains data
Responds to data quality issues
Example: "Data engineer implements customer dimension"

Data Consumer (Business Users)

Uses data for analysis
Reports data quality issues
Complies with access policies
Example: "Marketing analyst uses customer dimension"

Data Quality Framework

Three Levels of Quality

Level 1: Bronze Layer (Raw)
  └─ Quality: Unknown (as-is from source)

Level 2: Silver Layer (Validated)
  └─ Quality: Certified (passed checks)
    ├─ Uniqueness: customer_id is unique
    ├─ Completeness: 99.9%+ non-null for required fields
    ├─ Accuracy: Spot checks against source
    └─ Timeliness: Loaded within 4 hours

Level 3: Gold Layer (Business-Ready)
  └─ Quality: SLA guaranteed
    ├─ Accuracy: 99.95%+ (tested nightly)
    ├─ Freshness: Loaded by 6 AM daily
    ├─ Completeness: 100% for business-critical fields
    └─ Consistency: Matches source within tolerance

Implementing Quality Checks

# dbt: Automated quality tests
version: 2

models:
  - name: gold_fact_sales
    description: "Daily sales facts (SLA: 99.95% accuracy)"

    columns:
      - name: sale_id
        tests:
          - unique
          - not_null

      - name: customer_id
        tests:
          - not_null
          - relationships:  # Foreign key check
              to: ref('gold_dim_customer')
              field: customer_id

      - name: amount
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0
              max_value: 1000000  # Sanity check

      - name: sale_date
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_within_set:
              value_set: ['2023-01-01', '2024-12-31']

    tests:
      - dbt_utils.recency:
          datepart: day
          field: sale_date
          interval: 1  # Data loaded within 1 day

Quality Metrics: The Dashboard

-- Data Quality Scorecard (daily)
SELECT 
  table_name,
  ROW_COUNT,
  NULL_PERCENT,
  DUPLICATE_PERCENT,
  FRESHNESS_HOURS,
  QUALITY_SCORE,
  QUALITY_STATUS
FROM data_quality_metrics
WHERE QUALITY_SCORE < 95
ORDER BY QUALITY_SCORE ASC;

-- Output:
-- | gold_fact_sales | 125M | 0.1% | 0.0% | 2hrs | 99.8% | ✅ PASS |
-- | gold_dim_store  |  5K  | 0.2% | 0.0% |  3hrs| 99.7% | ✅ PASS |
-- | silver_customers| 10M | 5.0% | 2.0% | 8hrs | 88.5% | ⚠️  WARN |

Data Lineage: Tracking Origins

Why Lineage Matters

Bug found: Customer revenue is 10% too high in reports

Investigation:
  1. Check Gold layer (fact_sales) ✓ data looks correct
  2. Check Silver layer (joined dimensions) ✓ dimensions correct
  3. Check Bronze layer (extracted data) ✗ Found double-counting!
    → Source system has duplicate orders since 2024-01-15

Without lineage:
  └─ Cannot trace issue back to source (hours of debugging)

With lineage:
  └─ Click "show lineage" → immediately see Bronze has duplicates

Building Lineage: Data Catalog

dbt (Data Build Tool) provides lineage automatically:

┌─────────────────┐
│  raw.customers  │
└────────┬────────┘
         │
┌────────▼────────────┐
│ stg_customers       │ ← Staging layer
│ (clean, validate)   │
└────────┬────────────┘
         │
      ┌──┴──┐
      │     │
┌─────▼┐  ┌─┴──────────┐
│Orders│  │Customer ID♦│
└─────┬┘  └─┬──────────┘
      │     │
┌─────┴─────▼─────────┐
│ fact_sales          │ ← Gold fact table
│ (denormalized)      │
└─────────────────────┘
         │
      ┌──┴──────────┐
      │             │
┌─────▼────┐ ┌──────▼──────┐
│ Looker   │ │ Jupyter NB  │
└──────────┘ └─────────────┘

dbt produces:

Lineage graph (visual DAG)
Impact analysis ("What breaks if I change this table?")
Documentation with column-level lineage

Metadata Management: The Single Source of Truth

What is Metadata?

Data = "Alice, 123 Main St, Alice@example.com"
  ↓
Metadata = "Alice is a String (name), St is Address (location), email..."

Metadata Catalog (Collateral, Datahub, Apache Atlas)

Table: gold_dim_customer
├─ Owner: VP Sales
├─ Freshness: Daily 2 AM
├─ Accuracy: 99.95%
├─ PII: Yes (email, address)
├─ Sensitivity: Internal only
├─ Documentation: "Customer master dimension, SCD Type 2"
├─ Columns:
│  ├─ customer_id (INT, NOT NULL) — Business key
│  ├─ customer_name (VARCHAR 500) — Full name
│  ├─ email (VARCHAR) — PII, encrypt at rest
│  └─ segment (VARCHAR) — 'Premium' | 'Standard' | At-Risk'
├─ Upstream:
│  ├─ silver.customers
│  ├─ silver.support_tiers
│  └─ silver.marketing_segments
└─ Downstream:
   ├─ fact_sales (joined for revenue reports)
   ├─ fact_support (joined for NPS analysis)
   └─ Looker dashboard "Sales by Segment"

Benefits

✅ Self-service: Users can find data without asking
✅ Trust: See owner, quality, freshness in one place
✅ Governance: Enforce PII masking, retention policies
✅ Impact analysis: Know what breaks if you change something

Data Security & Privacy: Protecting PII

Classification: What Data Needs Protection?

PII (Personally Identifiable Information):
  └─ Examples: name, email, phone, SSN, address, IP address
  └─ Rules: Encrypt at rest, masked for dev, access audit trail

PHI (Protected Health Info):
  └─ Examples: medical records, diagnosis, treatment
  └─ Rules: HIPAA compliance required

Financial Data:
  └─ Examples: credit card, salary, transactions
  └─ Rules: SOX compliance, immutable logs

Public Data:
  └─ Examples: product catalog, published reports
  └─ Rules: No special protection needed

Implementation: Access Control

-- Option 1: Row-level security (RLS)
CREATE POLICY customer_privacy ON gold_dim_customer
  USING (customer_id IN (
    SELECT customer_id FROM user_access_permissions 
    WHERE user_id = current_user_id()
  ));

-- Query: Even with SELECT *, only sees authorized rows
SELECT * FROM gold_dim_customer;
-- Result: Only customers the user is authorized to see

Implementation: Column-Level Security

-- Option 2: Mask sensitive columns
CREATE VIEW gold_dim_customer_safe AS
SELECT 
  customer_id,
  customer_name,
  CASE 
    WHEN current_user_role = 'ANALYST' 
    THEN '***@example.com'
    ELSE email
  END as email,
  segment
FROM gold_dim_customer;

-- Analyst sees masked email
-- Admin sees real email

Implementation: Data Redaction

# Python: Redact SSN in logs
import re

def redact_pii(text):
    ssn_pattern = r'\d{3}-\d{2}-\d{4}'
    return re.sub(ssn_pattern, 'XXX-XX-XXXX', text)

# Before: "Customer 123 with SSN 123-45-6789"
# After: "Customer 123 with SSN XXX-XX-XXXX"

Cost Management: Following the Money

Cost Breakdown: Where Does Money Go?

BigQuery Project: $100K/month

├─ Data scanning: $75K (majority)
│  └─ 12 TB × $6.25/TB = $75K
│  └─ Cause: Inefficient queries, no partition pruning
│
├─ Storage: $15K
│  └─ 2.4 PB × $6.25/month = $15K
│  └─ Cause: Not deleting stale data
│
├─ Slots (reserved compute): $8K
│  └─ 100 slots reserved × $0.04/slot/hour × 730 hrs
│  └─ Cause: Predictable workload, reserved better than on-demand
│
└─ Export/egress: $2K
   └─ Data transferred OUT of Google Cloud
   └─ Cause: Exporting to data lakes

Optimization Strategies

Strategy	Impact	Effort
Partition pruning (WHERE on partition column)	-40%	Low
Column selection (SELECT only needed columns)	-30%	Low
Clustering on foreign keys	-20%	Medium
Materialized views (pre-aggregate common queries)	-50% for those queries	Medium
Data retention policy (delete > 1 year old)	-20% storage	Low
Reserved slots (annual commitment)	-40% compute cost	Medium

Cost Monitoring

-- BigQuery: Monitor project-level costs
SELECT 
  DATE(creation_time) as query_date,
  user_email,
  COUNT(*) as num_queries,
  SUM(ROUND(total_bytes_billed/1e9, 2)) as total_gb_billed,
  SUM(ROUND(total_bytes_billed/1e12, 4)) * 6.25 as est_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY 1, 2
ORDER BY 5 DESC;

-- Alert: If user_email's costs spike, investigate

Data Retention & Archival: Managing History

Retention Policy:

Gold layer: 5 years (SLA required)
Silver layer: 2 years (for debugging)
Bronze layer: 90 days (raw data cost-prohibitive)

Why?:
  - Gold: Business/legal requirements (5-year audits)
  - Silver: Good enough for recent debugging (2 years)
  - Bronze: Cost savings (raw data cheap to regenerate)

Implementation

-- Automated archival: Move old data to cheap storage

-- Month-end job: Archive old partitions
ALTER TABLE gold.fact_sales SET TBLPROPERTIES (
  'parquet.compression'='snappy'  -- Compress before archival
);

-- Delete partitions older than 5 years
DELETE FROM gold.fact_sales
WHERE sale_date < DATE_SUB(CURRENT_DATE(), INTERVAL 5 YEAR);

-- Export to Cloud Storage (cheaper storage)
EXPORT DATA OPTIONS(
  uri='gs://archive-bucket/fact_sales/*.parquet',
  format='PARQUET'
) AS
SELECT * FROM gold.fact_sales
WHERE sale_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 YEAR)
  AND sale_date < DATE_SUB(CURRENT_DATE(), INTERVAL 5 YEAR);

Summary: Enterprise Data Management Stack

graph TB
    A["Data Governance<br/>RACI Framework"]
    B["Data Quality<br/>dbt Tests"]
    C["Lineage Tracking<br/>dbt + Data Catalog"]
    D["Metadata Management<br/>Collateral/Datahub"]
    E["Security & Privacy<br/>RLS, Masking"]
    F["Cost Management<br/>Monitoring"]

    A -.-> B
    B -.-> C
    C -.-> D
    D -.-> E
    E -.-> F

Deep-Dives

→ Deep Dive: Data Governance — RACI, ownership models, policy enforcement.

→ Deep Dive: Data Quality — Test frameworks, SLAs, monitoring dashboards.

→ Deep Dive: Data Lineage — Lineage tools, impact analysis, compliance automation.

Key Takeaways

✅ Data governance requires clear ownership (data owner vs. steward vs. consumer).

✅ Quality is measured via tests and SLAs, not gut feeling.

✅ Lineage tracking enables fast debugging and compliance.

✅ Metadata catalog is the single source of truth.

✅ Security & privacy require role-based access, masking, encryption.

✅ Cost is manageable with monitoring, retention policies, optimization.

Practice Questions

Define the RACI for a data table ownership model.
Design a data quality framework with 3 levels (Bronze, Silver, Gold).
How would you implement column-level security for PII?
What metadata should be tracked for every table?
How would you investigate the root cause of a data quality issue using lineage?

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search