9.01 · Deep Dive: Data Governance Frameworks
Level: Advanced Time to read: 16 min Pre-reading: 09 · Enterprise Data Management After reading: You'll understand governance frameworks, ownership models, policy enforcement, and organizational structures for managing data assets.
Data Governance: Who Decides What?
Data Governance = Framework for defining roles, responsibilities, policies, and accountability for data management.
The Problem (Without Governance)
Startup (no governance):
└─ 1 analyst, 1 warehouse
"It works!"
Growth (chaos):
├─ 50 analysts
├─ 5 data warehouses
├─ 100+ data models
├─ Questions:
│ ├─ Who owns this data?
│ ├─ Is this certified?
│ ├─ Can I delete it?
│ ├─ Who can access it?
│ └─ Why is it wrong?
└─ No answers → disaster
Governance Framework: RACI Model
RACI = Responsible, Accountable, Consulted, Informed
| Component | Responsible | Accountable | Consulted | Informed |
|---|---|---|---|---|
| Data Quality | Data quality team | Data owner | Tech lead | All users |
| Data Access | Security team | Data owner | Privacy officer | Access requesters |
| Metadata | Data catalog team | Business owner | Analytics lead | All teams |
| Cost control | Finance team | Platform lead | Data consumers | Executives |
| Schema changes | Data engineer | Data architect | Stakeholders | Users |
| Retention policy | Compliance team | Data owner | Legal, Privacy | Data producers |
Role Definitions
Data Owner (Executive/Domain Lead)
Accountability: Certified data, access policies, business rules
Responsibilities:
- ✅ Define data quality standards (accuracy, completeness, timeliness)
- ✅ Approve access requests
- ✅ Certify data for production use
- ✅ Sign off on retention policies
- ✅ Escalate data quality issues
- ✅ Champion data culture in domain
Example: "VP Sales owns customer dimension (customer_id is authoritative source of truth)"
Data Steward (Data Team)
Accountability: Day-to-day data production and maintenance
Responsibilities:
- ✅ Implement quality checks
- ✅ Produce and maintain data pipelines
- ✅ Respond to quality issues
- ✅ Document data lineage
- ✅ Update metadata (definitions, ownership)
- ✅ Support data access requests
Example: "Data engineer maintains customer dimension table and runs nightly quality tests"
Data Custodian (Security/Compliance)
Accountability: Access control, privacy, compliance
Responsibilities:
- ✅ Enforce access policies
- ✅ Manage encryption and secrets
- ✅ Audit access logs
- ✅ Handle PII/sensitive data
- ✅ Ensure compliance (GDPR, CCPA, SOC2)
Example: "Security team enforces role-based access control (RBAC) to PII fields"
Data Consumer (Business Users)
Accountability: Using data correctly, reporting quality issues
Responsibilities:
- ✅ Use data according to policies
- ✅ Report data quality issues
- ✅ Request access through proper channels
- ✅ Follow retention/deletion requirements
Example: "Marketing analyst uses customer dimension, reports accuracy issues"
Governance Policies
Policy 1: Data Classification
# Classify data by sensitivity
Levels:
- PUBLIC: No sensitivity restrictions (e.g., product names)
- INTERNAL: Internal use only (e.g., employee data)
- CONFIDENTIAL: Sensitive business data (e.g., pricing)
- RESTRICTED: PII/regulated (e.g., SSN, email, phone)
Example:
dim_customer:
- customer_id: INTERNAL
- email: RESTRICTED (PII)
- phone: RESTRICTED (PII)
- created_at: INTERNAL
Enforcement:
- RESTRICTED fields require encryption
- Only authorized roles can query
- Audit all access (immutable logs)
Policy 2: Retention & Deletion
# How long to keep data before deleting
Rules:
- Transaction data (fact_sales): 3 years
- Customer data (dim_customer): 5 years (compliance)
- Logs (system logs): 90 days
- Test data: 30 days
Exceptions:
- Financial data: 7 years (regulatory)
- Medical data: 10 years (HIPAA)
Implementation:
- Schedule daily deletion jobs
- Log all deletions (for compliance)
- Backup before deletion (30-day window for recovery)
Policy 3: Change Management
# How to change data definitions
Process:
1. Owner + Data Engineer: Plan change
2. Notify consumers (email, Slack)
3. Apply change (with rollback plan)
4. Test with sample users
5. Document change (in data catalog)
6. Archive old definition (for historical reference)
Example: Renaming "status" column to "order_status"
- Impact: 23 dashboards, 5 reports
- Backward compatibility: Create alias
- Rollback: Revert within 24 hours if issues
No changes without approval (prevents breaking changes)
Policy 4: Access Control
# Who can access what
Model: Role-Based Access Control (RBAC)
Roles:
- analyst_sales: SELECT on dim_customer, fact_sales (no PII)
- analyst_marketing: SELECT on dim_customer (full, includes PII)
- finance_user: SELECT on financial_tables only
- admin: SELECT/UPDATE/DELETE all (with audit logging)
Implementation (SQL):
GRANT SELECT ON gold.dim_customer TO ROLE analyst_sales;
GRANT SELECT (customer_id, email, created_at) ON gold.dim_customer TO ROLE analyst_sales;
-- Restrict PII: no phone, no address
GRANT ALL ON gold.fact_sales TO ROLE analyst_sales;
Audit Trail:
- Log all SELECT queries (who, when, how many rows)
- Flag queries returning large PII datasets
- Alert on suspicious access patterns
Governance Tools
| Tool | Purpose | Examples |
|---|---|---|
| Data Catalog | Metadata repository, lineage tracking | Alation, Collibra, DataHub |
| Access Management | RBAC, SSO, MFA | Okta, Azure AD, AWS IAM |
| Data Quality | Monitor quality metrics, alerting | Soda, Great Expectations, dbt tests |
| Lineage | Track data dependencies, impact analysis | Apache Atlas, Openlineage, cloud-native |
| Compliance | PII detection, encryption, audit | Varonis, Immuta, Protegrity |
Governance Implementation Steps
Step 1: Assess Current State
Questions to Answer:
- Who owns each dataset? (currently: ?)
- What access controls exist? (currently: everyone has SELECT *)
- How are changes tracked? (currently: Slack, hope for best)
- What data is PII/sensitive? (currently: unknown)
- How often is data wrong? (currently: discovered by users)
Step 2: Define Framework
1. Identify data owners (per domain)
- Sales: VP Sales
- Marketing: VP Marketing
- Finance: CFO
2. Create data steward team
- Senior Data Engineer (lead)
- 2-3 data engineers (stewards)
- 1 data analyst (quality)
3. Establish policies
- Data classification matrix
- Retention schedule
- Change management process
- Access control rules
Step 3: Implement Tools
1. Deploy data catalog
- Document all tables
- Assign ownership
- Tag sensitive fields
- Capture lineage
2. Set up access control
- Create roles (analyst, engineer, admin)
- Implement RBAC in warehouse
- Audit all access
3. Add quality monitoring
- Define SLAs per dataset
- Run daily quality checks
- Alert on violations
Step 4: Monitor & Evolve
1. Monthly reviews
- Access audit (who accessed what?)
- Quality metrics (SLA pass rate?)
- Cost analysis
2. Quarterly updates
- Refresh ownership (do we still own it?)
- Update retention policies
- Review access rules
3. Annual assessment
- Framework effectiveness
- Stakeholder feedback
- Compliance audit
Common Mistakes to Avoid
❌ No clear ownership → Nobody responsible for quality ❌ Everyone is an "admin" → No security, accidental deletions ❌ No access audit → Don't know who accessed PII ❌ Change without notification → Breaking dashboards ❌ No retention policy → Data hoarding, storage costs ❌ Governance without tools → Manual processes, doesn't scale
Key Takeaways
- RACI model clarifies roles and responsibilities
- Data owner is accountable for quality and access
- Data steward implements and maintains policies
- Classify data by sensitivity (PUBLIC, RESTRICTED, etc.)
- Audit all access to PII/sensitive data
- Change management prevents breaking changes
- Tools automate governance (catalog, access control, quality)