09 · Enterprise Data Management: Governance, Lineage, Quality, Security
Level: Advanced Time to read: 22 min After reading: You'll understand how to manage data warehouses at organizational scale: governance frameworks, data quality, lineage tracking, and security controls.
The Enterprise Problem: Data As An Asset
Startup Phase (Year 0):
└─ 1 analyst, 1 data warehouse, "it works!"
Growth Phase (Year 2):
├─ 50 analysts
├─ 5 warehouses (BigQuery, Snowflake, Postgres)
├─ 100+ data models
├─ Questions:
│ ├─ Who owns this data?
│ ├─ Is this data quality certified?
│ ├─ Does PII need masking?
│ ├─ Which models depend on this table?
│ └─ Why did my query cost $50K this month?
└─ No answers → chaos!
Solution: Enterprise Data Management framework.
Data Governance: Who Decides What?
Governance Framework: RACI Model
| Component | Responsible | Accountable | Consulted | Informed |
|---|---|---|---|---|
| Data Quality | Data quality team | Data owner | Technical lead | All users |
| Data Access | Security team | Data owner | Privacy officer | Requesters |
| Metadata | Data catalog team | Business owner | Analytics lead | All teams |
| Cost control | Finance | Platform lead | Data consumers | Executives |
Role Definitions
Data Owner (DataViz/Analytics Lead)
- Defines business rules
- Certifies data quality
- Approves access requests
- Example: "VP Sales owns the customers dimension"
Data Steward (Data Team)
- Implements owner's policies
- Produces and maintains data
- Responds to data quality issues
- Example: "Data engineer implements customer dimension"
Data Consumer (Business Users)
- Uses data for analysis
- Reports data quality issues
- Complies with access policies
- Example: "Marketing analyst uses customer dimension"
Data Quality Framework
Three Levels of Quality
Level 1: Bronze Layer (Raw)
└─ Quality: Unknown (as-is from source)
Level 2: Silver Layer (Validated)
└─ Quality: Certified (passed checks)
├─ Uniqueness: customer_id is unique
├─ Completeness: 99.9%+ non-null for required fields
├─ Accuracy: Spot checks against source
└─ Timeliness: Loaded within 4 hours
Level 3: Gold Layer (Business-Ready)
└─ Quality: SLA guaranteed
├─ Accuracy: 99.95%+ (tested nightly)
├─ Freshness: Loaded by 6 AM daily
├─ Completeness: 100% for business-critical fields
└─ Consistency: Matches source within tolerance
Implementing Quality Checks
# dbt: Automated quality tests
version: 2
models:
- name: gold_fact_sales
description: "Daily sales facts (SLA: 99.95% accuracy)"
columns:
- name: sale_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships: # Foreign key check
to: ref('gold_dim_customer')
field: customer_id
- name: amount
tests:
- not_null
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 1000000 # Sanity check
- name: sale_date
tests:
- not_null
- dbt_expectations.expect_column_values_to_be_within_set:
value_set: ['2023-01-01', '2024-12-31']
tests:
- dbt_utils.recency:
datepart: day
field: sale_date
interval: 1 # Data loaded within 1 day
Quality Metrics: The Dashboard
-- Data Quality Scorecard (daily)
SELECT
table_name,
ROW_COUNT,
NULL_PERCENT,
DUPLICATE_PERCENT,
FRESHNESS_HOURS,
QUALITY_SCORE,
QUALITY_STATUS
FROM data_quality_metrics
WHERE QUALITY_SCORE < 95
ORDER BY QUALITY_SCORE ASC;
-- Output:
-- | gold_fact_sales | 125M | 0.1% | 0.0% | 2hrs | 99.8% | ✅ PASS |
-- | gold_dim_store | 5K | 0.2% | 0.0% | 3hrs| 99.7% | ✅ PASS |
-- | silver_customers| 10M | 5.0% | 2.0% | 8hrs | 88.5% | ⚠️ WARN |
Data Lineage: Tracking Origins
Why Lineage Matters
Bug found: Customer revenue is 10% too high in reports
Investigation:
1. Check Gold layer (fact_sales) ✓ data looks correct
2. Check Silver layer (joined dimensions) ✓ dimensions correct
3. Check Bronze layer (extracted data) ✗ Found double-counting!
→ Source system has duplicate orders since 2024-01-15
Without lineage:
└─ Cannot trace issue back to source (hours of debugging)
With lineage:
└─ Click "show lineage" → immediately see Bronze has duplicates
Building Lineage: Data Catalog
dbt (Data Build Tool) provides lineage automatically:
┌─────────────────┐
│ raw.customers │
└────────┬────────┘
│
┌────────▼────────────┐
│ stg_customers │ ← Staging layer
│ (clean, validate) │
└────────┬────────────┘
│
┌──┴──┐
│ │
┌─────▼┐ ┌─┴──────────┐
│Orders│ │Customer ID♦│
└─────┬┘ └─┬──────────┘
│ │
┌─────┴─────▼─────────┐
│ fact_sales │ ← Gold fact table
│ (denormalized) │
└─────────────────────┘
│
┌──┴──────────┐
│ │
┌─────▼────┐ ┌──────▼──────┐
│ Looker │ │ Jupyter NB │
└──────────┘ └─────────────┘
dbt produces:
- Lineage graph (visual DAG)
- Impact analysis ("What breaks if I change this table?")
- Documentation with column-level lineage
Metadata Management: The Single Source of Truth
What is Metadata?
Data = "Alice, 123 Main St, Alice@example.com"
↓
Metadata = "Alice is a String (name), St is Address (location), email..."
Metadata Catalog (Collateral, Datahub, Apache Atlas)
Table: gold_dim_customer
├─ Owner: VP Sales
├─ Freshness: Daily 2 AM
├─ Accuracy: 99.95%
├─ PII: Yes (email, address)
├─ Sensitivity: Internal only
├─ Documentation: "Customer master dimension, SCD Type 2"
├─ Columns:
│ ├─ customer_id (INT, NOT NULL) — Business key
│ ├─ customer_name (VARCHAR 500) — Full name
│ ├─ email (VARCHAR) — PII, encrypt at rest
│ └─ segment (VARCHAR) — 'Premium' | 'Standard' | At-Risk'
├─ Upstream:
│ ├─ silver.customers
│ ├─ silver.support_tiers
│ └─ silver.marketing_segments
└─ Downstream:
├─ fact_sales (joined for revenue reports)
├─ fact_support (joined for NPS analysis)
└─ Looker dashboard "Sales by Segment"
Benefits
✅ Self-service: Users can find data without asking
✅ Trust: See owner, quality, freshness in one place
✅ Governance: Enforce PII masking, retention policies
✅ Impact analysis: Know what breaks if you change something
Data Security & Privacy: Protecting PII
Classification: What Data Needs Protection?
PII (Personally Identifiable Information):
└─ Examples: name, email, phone, SSN, address, IP address
└─ Rules: Encrypt at rest, masked for dev, access audit trail
PHI (Protected Health Info):
└─ Examples: medical records, diagnosis, treatment
└─ Rules: HIPAA compliance required
Financial Data:
└─ Examples: credit card, salary, transactions
└─ Rules: SOX compliance, immutable logs
Public Data:
└─ Examples: product catalog, published reports
└─ Rules: No special protection needed
Implementation: Access Control
-- Option 1: Row-level security (RLS)
CREATE POLICY customer_privacy ON gold_dim_customer
USING (customer_id IN (
SELECT customer_id FROM user_access_permissions
WHERE user_id = current_user_id()
));
-- Query: Even with SELECT *, only sees authorized rows
SELECT * FROM gold_dim_customer;
-- Result: Only customers the user is authorized to see
Implementation: Column-Level Security
-- Option 2: Mask sensitive columns
CREATE VIEW gold_dim_customer_safe AS
SELECT
customer_id,
customer_name,
CASE
WHEN current_user_role = 'ANALYST'
THEN '***@example.com'
ELSE email
END as email,
segment
FROM gold_dim_customer;
-- Analyst sees masked email
-- Admin sees real email
Implementation: Data Redaction
# Python: Redact SSN in logs
import re
def redact_pii(text):
ssn_pattern = r'\d{3}-\d{2}-\d{4}'
return re.sub(ssn_pattern, 'XXX-XX-XXXX', text)
# Before: "Customer 123 with SSN 123-45-6789"
# After: "Customer 123 with SSN XXX-XX-XXXX"
Cost Management: Following the Money
Cost Breakdown: Where Does Money Go?
BigQuery Project: $100K/month
├─ Data scanning: $75K (majority)
│ └─ 12 TB × $6.25/TB = $75K
│ └─ Cause: Inefficient queries, no partition pruning
│
├─ Storage: $15K
│ └─ 2.4 PB × $6.25/month = $15K
│ └─ Cause: Not deleting stale data
│
├─ Slots (reserved compute): $8K
│ └─ 100 slots reserved × $0.04/slot/hour × 730 hrs
│ └─ Cause: Predictable workload, reserved better than on-demand
│
└─ Export/egress: $2K
└─ Data transferred OUT of Google Cloud
└─ Cause: Exporting to data lakes
Optimization Strategies
| Strategy | Impact | Effort |
|---|---|---|
| Partition pruning (WHERE on partition column) | -40% | Low |
| Column selection (SELECT only needed columns) | -30% | Low |
| Clustering on foreign keys | -20% | Medium |
| Materialized views (pre-aggregate common queries) | -50% for those queries | Medium |
| Data retention policy (delete > 1 year old) | -20% storage | Low |
| Reserved slots (annual commitment) | -40% compute cost | Medium |
Cost Monitoring
-- BigQuery: Monitor project-level costs
SELECT
DATE(creation_time) as query_date,
user_email,
COUNT(*) as num_queries,
SUM(ROUND(total_bytes_billed/1e9, 2)) as total_gb_billed,
SUM(ROUND(total_bytes_billed/1e12, 4)) * 6.25 as est_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY 1, 2
ORDER BY 5 DESC;
-- Alert: If user_email's costs spike, investigate
Data Retention & Archival: Managing History
Retention Policy:
Gold layer: 5 years (SLA required)
Silver layer: 2 years (for debugging)
Bronze layer: 90 days (raw data cost-prohibitive)
Why?:
- Gold: Business/legal requirements (5-year audits)
- Silver: Good enough for recent debugging (2 years)
- Bronze: Cost savings (raw data cheap to regenerate)
Implementation
-- Automated archival: Move old data to cheap storage
-- Month-end job: Archive old partitions
ALTER TABLE gold.fact_sales SET TBLPROPERTIES (
'parquet.compression'='snappy' -- Compress before archival
);
-- Delete partitions older than 5 years
DELETE FROM gold.fact_sales
WHERE sale_date < DATE_SUB(CURRENT_DATE(), INTERVAL 5 YEAR);
-- Export to Cloud Storage (cheaper storage)
EXPORT DATA OPTIONS(
uri='gs://archive-bucket/fact_sales/*.parquet',
format='PARQUET'
) AS
SELECT * FROM gold.fact_sales
WHERE sale_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 YEAR)
AND sale_date < DATE_SUB(CURRENT_DATE(), INTERVAL 5 YEAR);
Summary: Enterprise Data Management Stack
graph TB
A["Data Governance<br/>RACI Framework"]
B["Data Quality<br/>dbt Tests"]
C["Lineage Tracking<br/>dbt + Data Catalog"]
D["Metadata Management<br/>Collateral/Datahub"]
E["Security & Privacy<br/>RLS, Masking"]
F["Cost Management<br/>Monitoring"]
A -.-> B
B -.-> C
C -.-> D
D -.-> E
E -.-> F
Deep-Dives
→ Deep Dive: Data Governance — RACI, ownership models, policy enforcement.
→ Deep Dive: Data Quality — Test frameworks, SLAs, monitoring dashboards.
→ Deep Dive: Data Lineage — Lineage tools, impact analysis, compliance automation.
Key Takeaways
✅ Data governance requires clear ownership (data owner vs. steward vs. consumer).
✅ Quality is measured via tests and SLAs, not gut feeling.
✅ Lineage tracking enables fast debugging and compliance.
✅ Metadata catalog is the single source of truth.
✅ Security & privacy require role-based access, masking, encryption.
✅ Cost is manageable with monitoring, retention policies, optimization.
Practice Questions
- Define the RACI for a data table ownership model.
- Design a data quality framework with 3 levels (Bronze, Silver, Gold).
- How would you implement column-level security for PII?
- What metadata should be tracked for every table?
- How would you investigate the root cause of a data quality issue using lineage?