01 · Foundations: OLTP vs OLAP, Systems Taxonomy, Why Data Warehouses
Level: Beginner Time to read: 20 min After reading: You'll understand why data warehouses exist as distinct systems separate from operational databases.
The Fundamental Problem: Operational Databooks vs. Analytical Queries
Imagine you run an e-commerce platform:
- Operational need: "Process this order now" (fast writes, immediate consistency)
- Analytical need: "What were our top 100 products last quarter by region?" (slow reads, eventual consistency okay)
Attempt 1: Use the same database for both.
- ❌ Operational queries slow down when analyzing 2 years of historical data
- ❌ Complex analytical queries lock rows for minutes, blocking order processing
- ❌ Denormalization helps analysis but breaks operational coherence
Attempt 2: Separate systems — OLTP for operations, OLAP for analytics.
- ✅ Each system optimized for its access pattern
- ✅ No contention between transactional and analytical workloads
- ⚠️ Adds complexity: ETL/ELT, data consistency, latency
This separation is one of the most important architectural decisions in data-intensive organizations.
OLTP: Online Transaction Processing
Purpose: Handle real-time, transactional workloads.
Characteristics
| Aspect | Detail |
|---|---|
| Access Pattern | Small, frequent writes and updates |
| Data Model | Normalized (3NF or BCNF) to minimize anomalies |
| Key Constraint | ACID compliance (Atomicity, Consistency, Isolation, Durability) |
| Latency Target | Milliseconds to seconds |
| Storage | Row-oriented (InnoDB, PostgreSQL, Oracle) |
| Index Strategy | B-tree indexes on primary keys, foreign keys, frequently filtered columns |
| Scalability Strategy | Vertical scaling, replication for HA, sharding for partitioning |
What OLTP Does Well
Order Processing System:
├── INSERT new_order (1 row)
├── UPDATE inventory (multiple rows, atomic)
├── INSERT payment_transaction (1 row)
└── Commit or rollback entire transaction
- Strong consistency: All writes immediately visible to reads
- Low latency writes: Optimized for INSERT/UPDATE on single or small row sets
- Referential integrity: Foreign keys prevent invalid states
Why OLTP Struggles with Analytics
-- Analytical query on OLTP system: 😱 SLOW
SELECT
product_category,
QUARTER,
SUM(amount) as revenue,
COUNT(DISTINCT customer_id) as unique_customers
FROM orders
WHERE order_date BETWEEN '2023-01-01' AND '2024-12-31'
GROUP BY 1, 2
ORDER BY 3 DESC;
- Normalization penalty: Query needs to JOIN 5+ tables
- Row-oriented storage: Fetches entire rows even if you only need 2 columns from millions of rows
- Locking contention: Heavy analytical query blocks transactional writes
OLAP: Online Analytical Processing
Purpose: Enable fast, complex analytical queries on historical data.
Characteristics
| Aspect | Detail |
|---|---|
| Access Pattern | Large, infrequent reads (batch or interactive) |
| Data Model | Denormalized (star/snowflake) for fast joins |
| Key Constraint | Eventual consistency acceptable; historical accuracy required |
| Latency Target | Seconds to minutes (sometimes hours for batch) |
| Storage | Columnar (Parquet, ORC, BigQuery) for compression and selective column reads |
| Index Strategy | Partitioning by date/region, clustering by frequently-filtered dimensions |
| Scalability Strategy | Horizontal scaling via distributed/parallel query execution |
What OLAP Does Well
Analytics System:
├── Bulk INSERT from ETL (millions of rows)
├── UPDATE SCD Type 2 dimension (preserves history)
├── Complex analytical SELECT (weeks of aggregation)
└── No transactional isolation needed
- Columnar compression: 10–100x compression vs. uncompressed row storage
- Fast aggregations: Processes only needed columns, not entire rows
- Parallelization: Distributes query execution across servers
Why OLAP Struggles with Transactions
-- Transactional operation on OLAP system: 😱 IMPOSSIBLE
BEGIN TRANSACTION;
UPDATE account SET balance = balance - 100 WHERE id = 123;
UPDATE account SET balance = balance + 100 WHERE id = 456;
COMMIT;
- Eventual consistency: Another query might see balances in intermediate state
- Slow updates: Columnar updates require rewriting entire column segments
- No ACID guarantees: Distributed systems sacrifice consistency for availability
Architecture: Separation-of-Concerns Pattern
graph TB
A[Operational Data Sources<br/>Web App, Microservices, APIs]
B[OLTP Database<br/>PostgreSQL, MySQL, MongoDB]
C[ETL/ELT Pipeline<br/>Airflow, dbt, Dataflow]
D[Data Warehouse<br/>BigQuery, Snowflake, Databricks]
E[OLAP Queries]
F[Analytics & BI<br/>Tableau, Looker, Jupyter]
A -->|Real-time writes| B
B -->|Daily/Hourly Extract| C
C -->|Transform & Load| D
D -->|SQL/DataFrame API| E
E -->|Serve insights| F
Data Flow
| Stage | System | Responsibility |
|---|---|---|
| Source | Operational systems | Generate transactional data (orders, clicks, logs) |
| Stage 1: Extract | ETL/ELT | Read from OLTP without impacting production |
| Stage 2: Transform | ETL/ELT or Warehouse | Join, aggregate, apply business logic, handle quality |
| Stage 3: Load | OLAP Warehouse | Store in optimized dimensional/analytical structure |
| Stage 4: Query | Analytical tools | Run complex queries for BI, data science, reporting |
Key Trade-Offs
OLTP vs OLAP in One Table
| Dimension | OLTP | OLAP |
|---|---|---|
| Consistency Model | Strict (ACID) | Eventual/Eventual |
| Write Performance | Fast (optimized) | Slow (batch-oriented) |
| Read Performance | Moderate (normalized) | Fast (denormalized/columnar) |
| Storage Model | Row-oriented | Columnar |
| Data Structure | Normalized | Star/Snowflake schema |
| Historical Data | Recent (hot), archive excess | Years (full history) |
| Cost | Pay per transaction | Pay per storage/compute |
| Examples | PostgreSQL, MySQL, MongoDB | BigQuery, Snowflake, Databricks |
Why Separate Data Warehouses?
Why Not Just Add Cache/Replicas?
❌ Doesn't work well because:
- Analytical queries are fundamentally different (GROUP BY, aggregations, full table scans)
- Normalized schema hurts analytical query speed despite caching
- Maintaining consistency between normalized and denormalized copies creates complexity
✅ Separate OLAP systems work because:
- Denormalization at rest: No overhead of maintaining normalized form
- Columnar storage: 10–100x smaller, selective column reads, vectorized operations
- Parallel execution: Distributes large scans across servers
- Cost isolation: Analytical load doesn't affect operational billing
Why Not Just Use Data Lake (Raw Parquet Files)?
| Aspect | Data Lake | Data Warehouse |
|---|---|---|
| Schema | Schema-on-read (may be messy) | Schema-on-write (strictly validated) |
| Query Performance | Variable (depends on quality) | Consistent (managed schema) |
| Governance | Self-serve but chaotic | Controlled, auditable |
| For Analytics | ⚠️ Works if you like debugging data | ✅ Works everywhere |
| For Operational BI | ❌ Not reliable | ✅ Single source of truth |
Modern best practice: Combine both via Medallion Architecture
- Bronze layer: raw data lake (dump everything)
- Silver layer: cleansed, business-consistent layer
- Gold layer: business-ready tables (warehouse)
Mental Model: The Three-Layer Data Architecture
graph TD
A["🔴 OPERATIONAL LAYER<br/>(OLTP)<br/>MySQL, PostgreSQL, MongoDB"]
B["🟡 INTEGRATION LAYER<br/>(ETL/ELT)<br/>Airflow, Dataflow, Spark"]
C["🔵 ANALYTICAL LAYER<br/>(OLAP)<br/>BigQuery, Snowflake, Databricks"]
D["📊 CONSUMPTION LAYER<br/>BI, Data Science, Dashboards"]
A -->|"Extract: minimal impact"| B
B -->|"Transform & Load:<br/>denormalize, aggregate"| C
C -->|"Optimized fast queries"| D
Why This Matters
- Isolation: Analytical work doesn't degrade order processing
- Autonomy: Analytics teams own their data model independently
- Asynchrony: Latency between operational reality and analytical view is acceptable
- Scalability: Each layer scaled independently (small OLTP, large OLAP)
Summary: When to Use Each
Use OLTP when:
- You need strict consistency for financial transactions
- Writes/updates are frequent
- Low-latency response is critical
- Data volume is moderate
Use OLAP when:
- You need to analyze historical trends
- Reads vastly outnumber writes
- Complex multi-table aggregations are common
- Data volume is large and growing
- Latency (seconds/minutes) is acceptable
Use Both when:
- Running a real business with operational + analytical needs (almost always at scale)
- You need separation of concerns and cost isolation
Deep-Dives
→ Deep Dive: Data Warehouse Architecture — Move to logical design: dimensional models, schemas, and why they matter.
→ Deep Dive: Medallion Architecture — Learn the modern pattern for organizing warehouse data in layers (Bronze/Silver/Gold).
Key Takeaways
✅ OLTP and OLAP are fundamentally different access patterns; separating them is essential at scale.
✅ Data warehouses are denormalized, columnar, and optimized for analytical queries — not a 1:1 copy of operational data.
✅ The separation introduces latency and complexity (ETL/ELT) but enables autonomy and scalability.
✅ Modern architectures (medallion/lakehouse) blend both worlds: raw data lakes (flexibility) + warehouse layers (governance).
✅ Understanding this trade-off is the foundation of any data warehouse architecture discussion.