01 · Foundations: OLTP vs OLAP, Systems Taxonomy, Why Data Warehouses

Level: Beginner Time to read: 20 min After reading: You'll understand why data warehouses exist as distinct systems separate from operational databases.

The Fundamental Problem: Operational Databooks vs. Analytical Queries

Imagine you run an e-commerce platform:

Operational need: "Process this order now" (fast writes, immediate consistency)
Analytical need: "What were our top 100 products last quarter by region?" (slow reads, eventual consistency okay)

Attempt 1: Use the same database for both.

❌ Operational queries slow down when analyzing 2 years of historical data
❌ Complex analytical queries lock rows for minutes, blocking order processing
❌ Denormalization helps analysis but breaks operational coherence

Attempt 2: Separate systems — OLTP for operations, OLAP for analytics.

✅ Each system optimized for its access pattern
✅ No contention between transactional and analytical workloads
⚠️ Adds complexity: ETL/ELT, data consistency, latency

This separation is one of the most important architectural decisions in data-intensive organizations.

OLTP: Online Transaction Processing

Purpose: Handle real-time, transactional workloads.

Characteristics

Aspect	Detail
Access Pattern	Small, frequent writes and updates
Data Model	Normalized (3NF or BCNF) to minimize anomalies
Key Constraint	ACID compliance (Atomicity, Consistency, Isolation, Durability)
Latency Target	Milliseconds to seconds
Storage	Row-oriented (InnoDB, PostgreSQL, Oracle)
Index Strategy	B-tree indexes on primary keys, foreign keys, frequently filtered columns
Scalability Strategy	Vertical scaling, replication for HA, sharding for partitioning

What OLTP Does Well

Order Processing System:
├── INSERT new_order (1 row)
├── UPDATE inventory (multiple rows, atomic)
├── INSERT payment_transaction (1 row)
└── Commit or rollback entire transaction

Strong consistency: All writes immediately visible to reads
Low latency writes: Optimized for INSERT/UPDATE on single or small row sets
Referential integrity: Foreign keys prevent invalid states

Why OLTP Struggles with Analytics

-- Analytical query on OLTP system: 😱 SLOW
SELECT 
  product_category,
  QUARTER,
  SUM(amount) as revenue,
  COUNT(DISTINCT customer_id) as unique_customers
FROM orders
WHERE order_date BETWEEN '2023-01-01' AND '2024-12-31'
GROUP BY 1, 2
ORDER BY 3 DESC;

Normalization penalty: Query needs to JOIN 5+ tables
Row-oriented storage: Fetches entire rows even if you only need 2 columns from millions of rows
Locking contention: Heavy analytical query blocks transactional writes

OLAP: Online Analytical Processing

Purpose: Enable fast, complex analytical queries on historical data.

Characteristics

Aspect	Detail
Access Pattern	Large, infrequent reads (batch or interactive)
Data Model	Denormalized (star/snowflake) for fast joins
Key Constraint	Eventual consistency acceptable; historical accuracy required
Latency Target	Seconds to minutes (sometimes hours for batch)
Storage	Columnar (Parquet, ORC, BigQuery) for compression and selective column reads
Index Strategy	Partitioning by date/region, clustering by frequently-filtered dimensions
Scalability Strategy	Horizontal scaling via distributed/parallel query execution

What OLAP Does Well

Analytics System:
├── Bulk INSERT from ETL (millions of rows)
├── UPDATE SCD Type 2 dimension (preserves history)
├── Complex analytical SELECT (weeks of aggregation)
└── No transactional isolation needed

Columnar compression: 10–100x compression vs. uncompressed row storage
Fast aggregations: Processes only needed columns, not entire rows
Parallelization: Distributes query execution across servers

Why OLAP Struggles with Transactions

-- Transactional operation on OLAP system: 😱 IMPOSSIBLE
BEGIN TRANSACTION;
UPDATE account SET balance = balance - 100 WHERE id = 123;
UPDATE account SET balance = balance + 100 WHERE id = 456;
COMMIT;

Eventual consistency: Another query might see balances in intermediate state
Slow updates: Columnar updates require rewriting entire column segments
No ACID guarantees: Distributed systems sacrifice consistency for availability

Architecture: Separation-of-Concerns Pattern

graph TB
    A[Operational Data Sources<br/>Web App, Microservices, APIs]
    B[OLTP Database<br/>PostgreSQL, MySQL, MongoDB]
    C[ETL/ELT Pipeline<br/>Airflow, dbt, Dataflow]
    D[Data Warehouse<br/>BigQuery, Snowflake, Databricks]
    E[OLAP Queries]
    F[Analytics & BI<br/>Tableau, Looker, Jupyter]

    A -->|Real-time writes| B
    B -->|Daily/Hourly Extract| C
    C -->|Transform & Load| D
    D -->|SQL/DataFrame API| E
    E -->|Serve insights| F

Data Flow

Stage	System	Responsibility
Source	Operational systems	Generate transactional data (orders, clicks, logs)
Stage 1: Extract	ETL/ELT	Read from OLTP without impacting production
Stage 2: Transform	ETL/ELT or Warehouse	Join, aggregate, apply business logic, handle quality
Stage 3: Load	OLAP Warehouse	Store in optimized dimensional/analytical structure
Stage 4: Query	Analytical tools	Run complex queries for BI, data science, reporting

Key Trade-Offs

OLTP vs OLAP in One Table

Dimension	OLTP	OLAP
Consistency Model	Strict (ACID)	Eventual/Eventual
Write Performance	Fast (optimized)	Slow (batch-oriented)
Read Performance	Moderate (normalized)	Fast (denormalized/columnar)
Storage Model	Row-oriented	Columnar
Data Structure	Normalized	Star/Snowflake schema
Historical Data	Recent (hot), archive excess	Years (full history)
Cost	Pay per transaction	Pay per storage/compute
Examples	PostgreSQL, MySQL, MongoDB	BigQuery, Snowflake, Databricks

Why Separate Data Warehouses?

Why Not Just Add Cache/Replicas?

❌ Doesn't work well because:

Analytical queries are fundamentally different (GROUP BY, aggregations, full table scans)
Normalized schema hurts analytical query speed despite caching
Maintaining consistency between normalized and denormalized copies creates complexity

✅ Separate OLAP systems work because:

Denormalization at rest: No overhead of maintaining normalized form
Columnar storage: 10–100x smaller, selective column reads, vectorized operations
Parallel execution: Distributes large scans across servers
Cost isolation: Analytical load doesn't affect operational billing

Why Not Just Use Data Lake (Raw Parquet Files)?

Aspect	Data Lake	Data Warehouse
Schema	Schema-on-read (may be messy)	Schema-on-write (strictly validated)
Query Performance	Variable (depends on quality)	Consistent (managed schema)
Governance	Self-serve but chaotic	Controlled, auditable
For Analytics	⚠️ Works if you like debugging data	✅ Works everywhere
For Operational BI	❌ Not reliable	✅ Single source of truth

Modern best practice: Combine both via Medallion Architecture

Bronze layer: raw data lake (dump everything)
Silver layer: cleansed, business-consistent layer
Gold layer: business-ready tables (warehouse)

Mental Model: The Three-Layer Data Architecture

graph TD
    A["🔴 OPERATIONAL LAYER<br/>(OLTP)<br/>MySQL, PostgreSQL, MongoDB"]
    B["🟡 INTEGRATION LAYER<br/>(ETL/ELT)<br/>Airflow, Dataflow, Spark"]
    C["🔵 ANALYTICAL LAYER<br/>(OLAP)<br/>BigQuery, Snowflake, Databricks"]
    D["📊 CONSUMPTION LAYER<br/>BI, Data Science, Dashboards"]

    A -->|"Extract: minimal impact"| B
    B -->|"Transform & Load:<br/>denormalize, aggregate"| C
    C -->|"Optimized fast queries"| D

Why This Matters

Isolation: Analytical work doesn't degrade order processing
Autonomy: Analytics teams own their data model independently
Asynchrony: Latency between operational reality and analytical view is acceptable
Scalability: Each layer scaled independently (small OLTP, large OLAP)

Summary: When to Use Each

Use OLTP when:

You need strict consistency for financial transactions
Writes/updates are frequent
Low-latency response is critical
Data volume is moderate

Use OLAP when:

You need to analyze historical trends
Reads vastly outnumber writes
Complex multi-table aggregations are common
Data volume is large and growing
Latency (seconds/minutes) is acceptable

Use Both when:

Running a real business with operational + analytical needs (almost always at scale)
You need separation of concerns and cost isolation

Deep-Dives

→ Deep Dive: Data Warehouse Architecture — Move to logical design: dimensional models, schemas, and why they matter.

→ Deep Dive: Medallion Architecture — Learn the modern pattern for organizing warehouse data in layers (Bronze/Silver/Gold).

Key Takeaways

✅ OLTP and OLAP are fundamentally different access patterns; separating them is essential at scale.

✅ Data warehouses are denormalized, columnar, and optimized for analytical queries — not a 1:1 copy of operational data.

✅ The separation introduces latency and complexity (ETL/ELT) but enables autonomy and scalability.

✅ Modern architectures (medallion/lakehouse) blend both worlds: raw data lakes (flexibility) + warehouse layers (governance).

✅ Understanding this trade-off is the foundation of any data warehouse architecture discussion.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search