01 · Foundations: OLTP vs OLAP, Systems Taxonomy, Why Data Warehouses

Level: Beginner Time to read: 20 min After reading: You'll understand why data warehouses exist as distinct systems separate from operational databases.


The Fundamental Problem: Operational Databooks vs. Analytical Queries

Imagine you run an e-commerce platform:

  • Operational need: "Process this order now" (fast writes, immediate consistency)
  • Analytical need: "What were our top 100 products last quarter by region?" (slow reads, eventual consistency okay)

Attempt 1: Use the same database for both.

  • ❌ Operational queries slow down when analyzing 2 years of historical data
  • ❌ Complex analytical queries lock rows for minutes, blocking order processing
  • ❌ Denormalization helps analysis but breaks operational coherence

Attempt 2: Separate systems — OLTP for operations, OLAP for analytics.

  • ✅ Each system optimized for its access pattern
  • ✅ No contention between transactional and analytical workloads
  • ⚠️ Adds complexity: ETL/ELT, data consistency, latency

This separation is one of the most important architectural decisions in data-intensive organizations.


OLTP: Online Transaction Processing

Purpose: Handle real-time, transactional workloads.

Characteristics

Aspect Detail
Access Pattern Small, frequent writes and updates
Data Model Normalized (3NF or BCNF) to minimize anomalies
Key Constraint ACID compliance (Atomicity, Consistency, Isolation, Durability)
Latency Target Milliseconds to seconds
Storage Row-oriented (InnoDB, PostgreSQL, Oracle)
Index Strategy B-tree indexes on primary keys, foreign keys, frequently filtered columns
Scalability Strategy Vertical scaling, replication for HA, sharding for partitioning

What OLTP Does Well

Order Processing System:
├── INSERT new_order (1 row)
├── UPDATE inventory (multiple rows, atomic)
├── INSERT payment_transaction (1 row)
└── Commit or rollback entire transaction
  • Strong consistency: All writes immediately visible to reads
  • Low latency writes: Optimized for INSERT/UPDATE on single or small row sets
  • Referential integrity: Foreign keys prevent invalid states

Why OLTP Struggles with Analytics

-- Analytical query on OLTP system: 😱 SLOW
SELECT 
  product_category,
  QUARTER,
  SUM(amount) as revenue,
  COUNT(DISTINCT customer_id) as unique_customers
FROM orders
WHERE order_date BETWEEN '2023-01-01' AND '2024-12-31'
GROUP BY 1, 2
ORDER BY 3 DESC;
  • Normalization penalty: Query needs to JOIN 5+ tables
  • Row-oriented storage: Fetches entire rows even if you only need 2 columns from millions of rows
  • Locking contention: Heavy analytical query blocks transactional writes

OLAP: Online Analytical Processing

Purpose: Enable fast, complex analytical queries on historical data.

Characteristics

Aspect Detail
Access Pattern Large, infrequent reads (batch or interactive)
Data Model Denormalized (star/snowflake) for fast joins
Key Constraint Eventual consistency acceptable; historical accuracy required
Latency Target Seconds to minutes (sometimes hours for batch)
Storage Columnar (Parquet, ORC, BigQuery) for compression and selective column reads
Index Strategy Partitioning by date/region, clustering by frequently-filtered dimensions
Scalability Strategy Horizontal scaling via distributed/parallel query execution

What OLAP Does Well

Analytics System:
├── Bulk INSERT from ETL (millions of rows)
├── UPDATE SCD Type 2 dimension (preserves history)
├── Complex analytical SELECT (weeks of aggregation)
└── No transactional isolation needed
  • Columnar compression: 10–100x compression vs. uncompressed row storage
  • Fast aggregations: Processes only needed columns, not entire rows
  • Parallelization: Distributes query execution across servers

Why OLAP Struggles with Transactions

-- Transactional operation on OLAP system: 😱 IMPOSSIBLE
BEGIN TRANSACTION;
UPDATE account SET balance = balance - 100 WHERE id = 123;
UPDATE account SET balance = balance + 100 WHERE id = 456;
COMMIT;
  • Eventual consistency: Another query might see balances in intermediate state
  • Slow updates: Columnar updates require rewriting entire column segments
  • No ACID guarantees: Distributed systems sacrifice consistency for availability

Architecture: Separation-of-Concerns Pattern

graph TB
    A[Operational Data Sources<br/>Web App, Microservices, APIs]
    B[OLTP Database<br/>PostgreSQL, MySQL, MongoDB]
    C[ETL/ELT Pipeline<br/>Airflow, dbt, Dataflow]
    D[Data Warehouse<br/>BigQuery, Snowflake, Databricks]
    E[OLAP Queries]
    F[Analytics & BI<br/>Tableau, Looker, Jupyter]

    A -->|Real-time writes| B
    B -->|Daily/Hourly Extract| C
    C -->|Transform & Load| D
    D -->|SQL/DataFrame API| E
    E -->|Serve insights| F

Data Flow

Stage System Responsibility
Source Operational systems Generate transactional data (orders, clicks, logs)
Stage 1: Extract ETL/ELT Read from OLTP without impacting production
Stage 2: Transform ETL/ELT or Warehouse Join, aggregate, apply business logic, handle quality
Stage 3: Load OLAP Warehouse Store in optimized dimensional/analytical structure
Stage 4: Query Analytical tools Run complex queries for BI, data science, reporting

Key Trade-Offs

OLTP vs OLAP in One Table

Dimension OLTP OLAP
Consistency Model Strict (ACID) Eventual/Eventual
Write Performance Fast (optimized) Slow (batch-oriented)
Read Performance Moderate (normalized) Fast (denormalized/columnar)
Storage Model Row-oriented Columnar
Data Structure Normalized Star/Snowflake schema
Historical Data Recent (hot), archive excess Years (full history)
Cost Pay per transaction Pay per storage/compute
Examples PostgreSQL, MySQL, MongoDB BigQuery, Snowflake, Databricks

Why Separate Data Warehouses?

Why Not Just Add Cache/Replicas?

❌ Doesn't work well because:

  • Analytical queries are fundamentally different (GROUP BY, aggregations, full table scans)
  • Normalized schema hurts analytical query speed despite caching
  • Maintaining consistency between normalized and denormalized copies creates complexity

✅ Separate OLAP systems work because:

  • Denormalization at rest: No overhead of maintaining normalized form
  • Columnar storage: 10–100x smaller, selective column reads, vectorized operations
  • Parallel execution: Distributes large scans across servers
  • Cost isolation: Analytical load doesn't affect operational billing

Why Not Just Use Data Lake (Raw Parquet Files)?

Aspect Data Lake Data Warehouse
Schema Schema-on-read (may be messy) Schema-on-write (strictly validated)
Query Performance Variable (depends on quality) Consistent (managed schema)
Governance Self-serve but chaotic Controlled, auditable
For Analytics ⚠️ Works if you like debugging data ✅ Works everywhere
For Operational BI ❌ Not reliable ✅ Single source of truth

Modern best practice: Combine both via Medallion Architecture

  • Bronze layer: raw data lake (dump everything)
  • Silver layer: cleansed, business-consistent layer
  • Gold layer: business-ready tables (warehouse)

Mental Model: The Three-Layer Data Architecture

graph TD
    A["🔴 OPERATIONAL LAYER<br/>(OLTP)<br/>MySQL, PostgreSQL, MongoDB"]
    B["🟡 INTEGRATION LAYER<br/>(ETL/ELT)<br/>Airflow, Dataflow, Spark"]
    C["🔵 ANALYTICAL LAYER<br/>(OLAP)<br/>BigQuery, Snowflake, Databricks"]
    D["📊 CONSUMPTION LAYER<br/>BI, Data Science, Dashboards"]

    A -->|"Extract: minimal impact"| B
    B -->|"Transform & Load:<br/>denormalize, aggregate"| C
    C -->|"Optimized fast queries"| D

Why This Matters

  • Isolation: Analytical work doesn't degrade order processing
  • Autonomy: Analytics teams own their data model independently
  • Asynchrony: Latency between operational reality and analytical view is acceptable
  • Scalability: Each layer scaled independently (small OLTP, large OLAP)

Summary: When to Use Each

Use OLTP when:

  • You need strict consistency for financial transactions
  • Writes/updates are frequent
  • Low-latency response is critical
  • Data volume is moderate

Use OLAP when:

  • You need to analyze historical trends
  • Reads vastly outnumber writes
  • Complex multi-table aggregations are common
  • Data volume is large and growing
  • Latency (seconds/minutes) is acceptable

Use Both when:

  • Running a real business with operational + analytical needs (almost always at scale)
  • You need separation of concerns and cost isolation

Deep-Dives

Deep Dive: Data Warehouse Architecture — Move to logical design: dimensional models, schemas, and why they matter.

Deep Dive: Medallion Architecture — Learn the modern pattern for organizing warehouse data in layers (Bronze/Silver/Gold).


Key Takeaways

✅ OLTP and OLAP are fundamentally different access patterns; separating them is essential at scale.

✅ Data warehouses are denormalized, columnar, and optimized for analytical queries — not a 1:1 copy of operational data.

✅ The separation introduces latency and complexity (ETL/ELT) but enables autonomy and scalability.

✅ Modern architectures (medallion/lakehouse) blend both worlds: raw data lakes (flexibility) + warehouse layers (governance).

✅ Understanding this trade-off is the foundation of any data warehouse architecture discussion.