Scalability Patterns & Distributed Systems
Patterns and strategies for building scalable distributed systems.
Scaling Strategies
Vertical Scaling (Scale Up)
Executive Summary: Add more resources (CPU, RAM, SSD) to existing machine. Simple but has hardware limits and creates single point of failure.
Pros:
- Simple, no code changes
- No distributed system complexity
- ACID compliance easier
Cons:
- Hardware limits
- Single point of failure
- Expensive at high end
- Downtime for upgrades
Horizontal Scaling (Scale Out)
Executive Summary: Add more machines to the pool. Better for handling load, but requires distributed system design. Preferred for web-scale.
Pros:
- Theoretically unlimited scaling
- Better fault tolerance
- Cost-effective (commodity hardware)
- No downtime for scaling
Cons:
- Distributed system complexity
- Data consistency challenges
- Network overhead
- Code changes may be required
Distributed System Challenges
The Eight Fallacies of Distributed Computing
Executive Summary: Assumptions developers wrongly make about distributed systems. Understanding these prevents common mistakes.
- The network is reliable → Handle failures, timeouts, retries
- Latency is zero → Design for network delays
- Bandwidth is infinite → Minimize data transfer
- The network is secure → Encrypt, authenticate
- Topology doesn’t change → Handle dynamic infrastructure
- There is one administrator → Plan for multi-team ownership
- Transport cost is zero → Optimize serialization
- The network is homogeneous → Handle different protocols/systems
Split-Brain Problem
Executive Summary: Network partition causes nodes to believe they’re the only survivors. Multiple nodes may act as leader simultaneously.
Solutions:
- Quorum-based decisions
- Fencing tokens
- STONITH (Shoot The Other Node In The Head)
- Shared storage for state
Byzantine Fault Tolerance
Executive Summary: Handling nodes that may behave arbitrarily (malicious or buggy). More complex than crash-fault tolerance.
- Requires 3f+1 nodes to tolerate f Byzantine faults
- Practical Byzantine Fault Tolerance (PBFT) algorithm
- Used in blockchain systems
Consensus Algorithms
Paxos
Executive Summary: Classic consensus algorithm. Guarantees agreement on a single value among distributed nodes. Complex to implement correctly.
Roles:
- Proposers: Suggest values
- Acceptors: Vote on proposals
- Learners: Learn chosen value
Phases:
- Prepare: Proposer asks acceptors to promise not to accept older proposals
- Accept: If majority promise, proposer sends accept request
- Learn: Acceptors notify learners of chosen value
Use Cases: Chubby, Spanner, Zookeeper (ZAB is Paxos variant)
Raft
Executive Summary: Understandable consensus algorithm. Equivalent to Paxos but easier to implement. Leader-based approach.
Key Concepts:
- Leader Election: One leader at a time, elected by majority
- Log Replication: Leader replicates log entries to followers
- Safety: Only leader with most up-to-date log can win election
Terms: Logical time periods, new term for each election
States: Follower → Candidate → Leader
Use Cases: etcd, Consul, CockroachDB, TiKV
| Paxos | Raft |
|---|---|
| Multi-decree complex | Simpler leader-based |
| Academic | Practical |
| Harder to implement | Well-documented |
ZAB (Zookeeper Atomic Broadcast)
Executive Summary: Paxos variant optimized for primary-backup systems. Used by Apache Zookeeper.
Phases:
- Discovery
- Synchronization
- Broadcast
Viewstamped Replication
Executive Summary: Early consensus protocol, precursor to Raft. Uses view changes for leader election.
Leader Election
Executive Summary: Process of selecting a single coordinator node. Critical for primary-backup systems, distributed locks, coordination.
Algorithms:
- Bully Algorithm: Highest ID wins
- Ring Algorithm: Token passing
- Raft/Paxos: Consensus-based election
Challenges:
- Network partitions
- Split-brain
- Byzantine failures
Tools: Zookeeper, etcd, Consul
Clock Synchronization
Wall Clock vs Monotonic Clock
Executive Summary: Wall clocks can jump (NTP sync), monotonic clocks only move forward. Use monotonic for measuring durations.
| Wall Clock | Monotonic Clock |
|---|---|
| Can go backwards | Always increases |
| NTP synchronized | Local to machine |
| For timestamps | For durations |
Vector Clocks
Executive Summary: Track causality in distributed systems. Each node maintains vector of counters. Detect concurrent events.
Node A: [A:1, B:0, C:0]
Node B: [A:1, B:1, C:0] (after receiving from A)
Comparison:
- V1 < V2: V1 happened-before V2
-
V1
Use Cases: Conflict detection in eventual consistency (Dynamo, Riak)
Lamport Timestamps
Executive Summary: Simpler than vector clocks. Single counter per node. Provides partial ordering.
Rules:
- Increment counter before each event
- On send: include counter in message
- On receive: counter = max(local, received) + 1
Limitation: Cannot detect concurrent events
Hybrid Logical Clocks (HLC)
Executive Summary: Combine physical time with logical counters. Causality + real-time proximity.
- Used by CockroachDB, Spanner
- Physical component for human-readable timestamps
- Logical component for ordering
Failure Detection
Heartbeat
Executive Summary: Nodes periodically send “I’m alive” messages. Simple but can have false positives during network issues.
Parameters:
- Heartbeat interval
- Timeout threshold
- Number of missed heartbeats
Phi Accrual Failure Detector
Executive Summary: Probabilistic failure detection. Outputs suspicion level (φ) rather than binary up/down.
- Adapts to network conditions
- Used by Cassandra, Akka
- Higher φ = more suspicious of failure
SWIM Protocol
Executive Summary: Scalable Weakly-consistent Infection-style Membership. Efficient failure detection via random probing.
How it works:
- Pick random node, send ping
- If no response, ask k other nodes to probe
- If still no response, mark suspected
- Disseminate membership via piggybacked gossip
Use Cases: HashiCorp Memberlist, Serf
Gossip Protocol
Executive Summary: Epidemic-style information dissemination. Nodes randomly exchange information with peers. Eventually all nodes converge.
Properties:
- Scalable (O(log n) rounds to converge)
- Fault tolerant
- Decentralized
- Eventually consistent
Use Cases:
- Failure detection
- Membership management
- Data dissemination
- Aggregate computation
Implementations: Cassandra, Consul, Riak
Data Replication
Synchronous vs Asynchronous
| Synchronous | Asynchronous |
|---|---|
| Wait for all replicas | Fire and forget |
| Stronger consistency | Higher availability |
| Higher latency | Lower latency |
| Simpler reasoning | Potential data loss |
Semi-Synchronous
Executive Summary: Wait for at least one replica acknowledgment. Balance between consistency and latency.
Chain Replication
Executive Summary: Replicas form a chain. Writes go to head, reads from tail. Strong consistency with good throughput.
Write → Head → Node2 → Node3 → Tail (Read)
Properties:
- Strong consistency
- High read throughput
- Write latency = chain length
- Tail handles all reads
Quorum
Executive Summary: Require majority agreement for operations. W + R > N ensures overlap between read and write sets.
Common configurations:
- W=N, R=1: Strong consistency, slow writes
- W=1, R=N: Fast writes, slow reads
- W=R=⌈(N+1)/2⌉: Balanced
Sloppy Quorum: Accept writes even if preferred nodes unavailable (with hinted handoff)
Anti-Entropy Protocols
Read Repair
Executive Summary: Fix inconsistencies during normal reads. Compare versions from multiple replicas, update stale ones.
Anti-Entropy (Merkle Trees)
Executive Summary: Background process to detect and fix inconsistencies. Use Merkle trees to efficiently compare large datasets.
Process:
- Build Merkle tree of data
- Compare root hashes
- If different, traverse to find divergent branches
- Only transfer/fix differing data
Hinted Handoff
Executive Summary: When target node is down, store write hint on available node. Deliver when target recovers.
- Improves write availability
- Temporary inconsistency acceptable
- Used by Dynamo, Cassandra
Conflict Resolution
Last-Write-Wins (LWW)
Executive Summary: Highest timestamp wins. Simple but can lose updates. Requires synchronized clocks.
Problems:
- Data loss
- Clock skew issues
- Not suitable for all use cases
Vector Clocks
Executive Summary: Track version history per replica. Detect conflicts by comparing vectors. Application resolves.
CRDTs (Conflict-free Replicated Data Types)
Executive Summary: Data structures that mathematically guarantee convergence. No coordination needed.
Types:
- G-Counter: Grow-only counter
- PN-Counter: Increment/decrement counter
- G-Set: Grow-only set
- OR-Set: Observed-Remove set
- LWW-Register: Last-writer-wins register
Use Cases:
- Collaborative editing
- Shopping carts
- Distributed counters
Implementations: Redis CRDT, Riak
Idempotency
Executive Summary: Operation can be applied multiple times with same effect. Critical for safe retries in distributed systems.
Techniques:
- Idempotency keys: Client-generated unique ID per operation
- Deduplication: Track processed request IDs
- Conditional updates: Compare-and-swap
Naturally idempotent:
- GET, PUT, DELETE (if same parameters)
- Set value to X
Not idempotent:
- POST (create new resource)
- Increment counter
Circuit Breaker
Executive Summary: Prevent cascading failures by failing fast. Stop calling failing service, allow recovery time.
States:
- Closed: Normal operation, track failures
- Open: Fail immediately, don’t call service
- Half-Open: Test if service recovered
Closed → (failures exceed threshold) → Open
Open → (timeout expires) → Half-Open
Half-Open → (success) → Closed
Half-Open → (failure) → Open
Implementation: Hystrix, Resilience4j, Polly
Bulkhead Pattern
Executive Summary: Isolate components to contain failures. Like ship bulkheads preventing flooding.
Types:
- Thread pool isolation: Separate pools per service
- Connection pool isolation: Separate pools per dependency
- Semaphore isolation: Limit concurrent requests
Backpressure
Executive Summary: Mechanism for slow consumers to signal producers to slow down. Prevents buffer overflow and memory issues.
Strategies:
- Drop: Discard excess messages
- Buffer: Queue messages (limited)
- Block: Block producer
- Rate limit: Throttle producer
Implementations: Reactive Streams, TCP flow control
Saga Pattern
Executive Summary: Manage distributed transactions via sequence of local transactions. Each step has compensating action for rollback.
Choreography
Executive Summary: Services react to events, no central coordinator. Simpler but harder to track.
OrderService → Payment Event → PaymentService → Shipping Event → ...
Orchestration
Executive Summary: Central orchestrator directs the flow. Easier to understand but single point of failure.
Orchestrator → OrderService
Orchestrator → PaymentService
Orchestrator → ShippingService
Compensation: Each step defines undo action
- CreateOrder → CancelOrder
- ChargePayment → RefundPayment
- ReserveInventory → ReleaseInventory
Event Sourcing
Executive Summary: Store all changes as immutable events instead of current state. Derive state by replaying events.
Benefits:
- Complete audit trail
- Time travel (reconstruct any past state)
- Event replay for debugging
- Natural fit for CQRS
Challenges:
- Event schema evolution
- Large event stores
- Complexity
Event Store: Append-only log
- EventStoreDB
- Kafka as event store
CQRS (Command Query Responsibility Segregation)
Executive Summary: Separate read and write models. Optimize each independently. Often paired with Event Sourcing.
Write Path: Command → Aggregate → Event → Event Store
Read Path: Event Store → Projector → Read DB → Query
Benefits:
- Optimize reads and writes independently
- Different scaling strategies
- Simpler models
Use Cases:
- High read/write ratio difference
- Complex domain models
- Event-driven architectures
Outbox Pattern
Executive Summary: Ensure atomicity between database update and event publishing. Write event to outbox table in same transaction.
1. Begin Transaction
2. Update business data
3. Insert event into Outbox table
4. Commit Transaction
5. Background process publishes from Outbox
6. Delete from Outbox after publish confirmed
Benefits:
- Guaranteed delivery
- At-least-once semantics
- No two-phase commit needed
Change Data Capture (CDC)
Executive Summary: Capture database changes as events. Stream to message queue or other systems.
Methods:
- Log-based: Read database transaction log (preferred)
- Trigger-based: Database triggers capture changes
- Timestamp-based: Poll for changed rows
Tools: Debezium, Maxwell, AWS DMS
Use Cases:
- Event streaming from legacy systems
- Data replication
- Cache invalidation
- Audit logging
Service Mesh
Executive Summary: Infrastructure layer handling service-to-service communication. Provides observability, security, reliability without code changes.
Features:
- Load balancing
- Service discovery
- Circuit breaking
- Mutual TLS
- Observability
- Traffic management
Components:
- Data Plane: Sidecar proxies (Envoy)
- Control Plane: Configuration, policy
Implementations: Istio, Linkerd, Consul Connect
Sidecar Pattern
Executive Summary: Deploy helper container alongside main container. Handles cross-cutting concerns without modifying application.
Use Cases:
- Logging agents
- Service mesh proxy
- Configuration refresh
- Security
Blue-Green Deployment
Executive Summary: Run two identical production environments. Switch traffic after testing new version.
Blue (current) ←── Traffic
Green (new) ←── Test
After validation: Switch traffic to Green
Benefits: Instant rollback, zero downtime
Canary Deployment
Executive Summary: Gradually roll out to small percentage of users first. Monitor, then expand or rollback.
v1: 95% traffic
v2: 5% traffic (canary)
Monitor metrics...
v1: 50% → v2: 50%
v1: 0% → v2: 100%
Feature Flags
Executive Summary: Control feature availability without deployment. Enable/disable features dynamically.
Use Cases:
- Gradual rollout
- A/B testing
- Kill switches
- Beta features
Tools: LaunchDarkly, Split, Flagsmith
Back-of-Envelope Calculations
Common Numbers
L1 cache reference: 0.5 ns
L2 cache reference: 7 ns
Main memory reference: 100 ns
SSD random read: 150 μs
HDD seek: 10 ms
Packet roundtrip (same datacenter): 500 μs
Packet roundtrip (CA to Netherlands): 150 ms
Storage
1 byte = 8 bits
1 KB = 1,000 bytes
1 MB = 1,000 KB
1 GB = 1,000 MB
1 TB = 1,000 GB
ASCII char: 1 byte
Unicode char: 2-4 bytes
UUID: 16 bytes
Timestamp: 8 bytes
Scale
1 million = 10^6
1 billion = 10^9
Seconds in day: 86,400 ≈ 10^5
Seconds in year: 31,536,000 ≈ 3 × 10^7
QPS Estimates
Daily Active Users (DAU): X
Avg requests per user per day: Y
QPS = (X × Y) / 86400
Peak QPS ≈ 2-3 × average QPS
System Design Interview Framework
1. Requirements Clarification (3-5 min)
- Functional requirements
- Non-functional requirements
- Scale estimates
- Constraints and assumptions
2. Back-of-Envelope Estimation (3-5 min)
- QPS
- Storage
- Bandwidth
- Memory (cache)
3. High-Level Design (10-15 min)
- Core components
- Data flow
- APIs
- Database schema
4. Deep Dive (15-20 min)
- Scaling bottlenecks
- Database choices
- Caching strategies
- Failure handling
5. Wrap Up (3-5 min)
- Trade-offs discussed
- Future improvements
- Edge cases