CAP Theorem — Deep Dive
Level: Intermediate
Pre-reading: 01 · Architectural Foundations
The Theorem
In a distributed data store, you can only guarantee two out of three properties simultaneously:
| Property | Definition |
|---|---|
| Consistency (C) | Every read receives the most recent write or an error |
| Availability (A) | Every request receives a non-error response (though it might not be the latest data) |
| Partition Tolerance (P) | The system continues to operate despite network partitions between nodes |
graph TD
subgraph CAP Triangle
C[Consistency]
A[Availability]
P[Partition Tolerance]
end
C --- A
A --- P
P --- C
Why You Must Choose P
In any real distributed system, network partitions will happen. Machines fail, switches die, cables get unplugged, data centers lose connectivity. You cannot wish partitions away.
Partition tolerance is not optional. The real choice is: what do you sacrifice when a partition occurs — consistency or availability?
sequenceDiagram
participant Client
participant NodeA
participant NodeB
Note over NodeA,NodeB: Network partition occurs
Client->>NodeA: Write X=5
NodeA->>NodeA: Store X=5 locally
NodeA--xNodeB: Cannot replicate (partition)
Client->>NodeB: Read X
Note over NodeB: What should NodeB return?
During a partition:
- CP system: NodeB returns an error (sacrifices availability to maintain consistency)
- AP system: NodeB returns stale data (sacrifices consistency to remain available)
CP Systems — Consistency over Availability
CP systems prioritize returning correct data. If they cannot guarantee the latest value, they refuse to respond.
| System | Behavior During Partition |
|---|---|
| PostgreSQL (single primary) | Writes to primary only; reads from primary or fail |
| HBase | Strongly consistent reads; unavailable if region server down |
| Zookeeper | Majority quorum required for reads/writes |
| etcd | Raft consensus; unavailable without majority |
| MongoDB (with majority write concern) | Waits for majority acknowledgment |
When to Choose CP
| Use Case | Why Consistency Matters |
|---|---|
| Financial transactions | Can't have inconsistent account balances |
| Inventory management | Overselling causes real-world problems |
| Leader election | Only one leader must exist at a time |
| Configuration management | Stale config can cause outages |
CP doesn't mean always consistent
CP means the system chooses consistency when a partition occurs. During normal operation, most CP systems are highly available. The trade-off only manifests during failures.
AP Systems — Availability over Consistency
AP systems always respond, even if the response might be stale. They accept writes during partitions and reconcile later.
| System | Behavior During Partition |
|---|---|
| Cassandra | Writes accepted on available nodes; reconciled later |
| CouchDB | Multi-master; conflicts resolved on read |
| DynamoDB (default mode) | Eventually consistent reads; highly available |
| Riak | Vector clocks for conflict resolution |
| DNS | Cached responses; eventual propagation |
When to Choose AP
| Use Case | Why Availability Matters |
|---|---|
| Social media feeds | Slightly stale timeline is acceptable |
| Product catalogs | Showing last-known price is better than error |
| Session stores | Better to let users in than lock them out |
| Analytics/Metrics | Approximate counts are acceptable |
Conflict Resolution Strategies
When partitions heal, AP systems must reconcile divergent data:
| Strategy | Description |
|---|---|
| Last Write Wins (LWW) | Timestamp-based; latest write overwrites |
| Vector Clocks | Track causality; detect concurrent writes |
| CRDTs | Conflict-free data types; mathematically guaranteed merge |
| Application-level | Push conflicts to application for resolution |
PACELC — A More Complete Model
CAP only describes behavior during partitions. In practice, you also care about behavior during normal operation.
PACELC: If there's a Partition, choose Availability or Consistency; Else, when running normally, choose Latency or Consistency.
graph TD
A[Network State] --> B{Partition?}
B -->|Yes| C{Choose A or C}
B -->|No| D{Choose L or C}
C -->|A| E[AP: Stay available]
C -->|C| F[CP: Reject requests]
D -->|L| G[EL: Fast but eventual]
D -->|C| H[EC: Slow but strong]
| System | P+A or P+C | E+L or E+C | PACELC Classification |
|---|---|---|---|
| DynamoDB | PA | EL | PA/EL — fast and available |
| Cassandra | PA | EL | PA/EL — eventual consistency |
| MongoDB | PC | EC | PC/EC — consistent but slower |
| PostgreSQL | PC | EC | PC/EC — ACID compliance |
| PNUTS (Yahoo) | PC | EL | PC/EL — consistent during partition, fast normally |
Consistency Levels — Tunable Trade-offs
Many distributed databases let you choose consistency level per operation:
Cassandra Consistency Levels
| Level | Description | Trade-off |
|---|---|---|
| ONE | Wait for one replica acknowledgment | Fastest; may read stale |
| QUORUM | Wait for majority (N/2 + 1) | Balanced; consistent if read+write > N |
| ALL | Wait for all replicas | Slowest; highest consistency |
| LOCAL_QUORUM | Quorum within local datacenter | Low latency; cross-DC eventual |
Read-Your-Writes Consistency
A common requirement: after writing, the same client must see their own write.
sequenceDiagram
participant Client
participant WriteNode
participant ReadNode
Client->>WriteNode: Write X=5
WriteNode-->>ReadNode: Async replication
Client->>ReadNode: Read X
Note over ReadNode: May return X=old value
Solutions:
- Route reads to the same node that handled the write (sticky sessions)
- Include write timestamp in request; read waits until caught up
- Use quorum reads + writes
Practical CAP Trade-offs
Pattern: Strong Consistency for Writes, Eventual for Reads
Many systems use different consistency for different operations:
| Operation | Consistency | Rationale |
|---|---|---|
| Place order | Strong (CP) | Can't lose orders or double-charge |
| View order history | Eventual (AP) | Slight delay acceptable |
| Update inventory | Strong (CP) | Prevent overselling |
| Browse products | Eventual (AP) | Stale catalog is acceptable |
Pattern: CP for System of Record, AP for Read Replicas
graph LR
W[Writes] --> P[(Primary DB - CP)]
P -->|Async Replication| R1[(Read Replica 1)]
P -->|Async Replication| R2[(Read Replica 2)]
R[Reads] --> R1
R --> R2
CAP Misconceptions
| Misconception | Reality |
|---|---|
| "CAP means pick 2 of 3" | Partition tolerance isn't optional; pick C or A during partitions |
| "CA systems exist" | CA requires no partitions; impossible in distributed systems |
| "CP means no availability" | CP systems are available when there's no partition |
| "AP means no consistency" | AP systems are eventually consistent; they converge |
| "CAP applies always" | CAP is about behavior during partitions only |
Designing for Partitions
| Strategy | Description |
|---|---|
| Detect partitions | Heartbeats, gossip protocols, failure detectors |
| Degrade gracefully | Switch to read-only mode; queue writes for later |
| Use idempotent operations | Safely retry when partition heals |
| Design for reconciliation | Application logic to merge conflicts |
| Set timeouts appropriately | Distinguish slow responses from partitions |
Can you have a CA distributed system?
Not in any practical sense. CA requires that network partitions never occur, which is impossible in real distributed systems. A single-node database is trivially CA (no network to partition), but any multi-node system must be prepared for partitions. In practice, "CA" systems are really CP or AP systems that haven't experienced a partition yet.
How does DynamoDB handle the CAP trade-off?
DynamoDB is AP by default (eventually consistent reads) but offers strong consistency as an option. You can request strongly consistent reads (CP behavior) at the cost of higher latency and lower availability. For writes, DynamoDB replicates to multiple nodes and returns when a quorum acknowledges. This makes it tunable: PA/EL by default, but you can opt into PC/EC per operation.
What consistency model would you choose for a shopping cart?
Eventual consistency (AP) with conflict resolution. A shopping cart is a good fit for CRDTs (specifically, an add-wins set). If a user adds items on two devices during a partition, both items should appear when the partition heals. LWW would lose one addition. The worst case (duplicate items) is better than lost items, and the user can easily remove duplicates.