Scalability Patterns & Distributed Systems

10 minute read

Patterns and strategies for building scalable distributed systems.

Scaling Strategies

Vertical Scaling (Scale Up)

Executive Summary: Add more resources (CPU, RAM, SSD) to existing machine. Simple but has hardware limits and creates single point of failure.

Pros:

Simple, no code changes
No distributed system complexity
ACID compliance easier

Cons:

Hardware limits
Single point of failure
Expensive at high end
Downtime for upgrades

Horizontal Scaling (Scale Out)

Executive Summary: Add more machines to the pool. Better for handling load, but requires distributed system design. Preferred for web-scale.

Pros:

Theoretically unlimited scaling
Better fault tolerance
Cost-effective (commodity hardware)
No downtime for scaling

Cons:

Distributed system complexity
Data consistency challenges
Network overhead
Code changes may be required

Distributed System Challenges

The Eight Fallacies of Distributed Computing

Executive Summary: Assumptions developers wrongly make about distributed systems. Understanding these prevents common mistakes.

The network is reliable → Handle failures, timeouts, retries
Latency is zero → Design for network delays
Bandwidth is infinite → Minimize data transfer
The network is secure → Encrypt, authenticate
Topology doesn’t change → Handle dynamic infrastructure
There is one administrator → Plan for multi-team ownership
Transport cost is zero → Optimize serialization
The network is homogeneous → Handle different protocols/systems

Split-Brain Problem

Executive Summary: Network partition causes nodes to believe they’re the only survivors. Multiple nodes may act as leader simultaneously.

Solutions:

Quorum-based decisions
Fencing tokens
STONITH (Shoot The Other Node In The Head)
Shared storage for state

Byzantine Fault Tolerance

Executive Summary: Handling nodes that may behave arbitrarily (malicious or buggy). More complex than crash-fault tolerance.

Requires 3f+1 nodes to tolerate f Byzantine faults
Practical Byzantine Fault Tolerance (PBFT) algorithm
Used in blockchain systems

Consensus Algorithms

Paxos

Executive Summary: Classic consensus algorithm. Guarantees agreement on a single value among distributed nodes. Complex to implement correctly.

Roles:

Proposers: Suggest values
Acceptors: Vote on proposals
Learners: Learn chosen value

Phases:

Prepare: Proposer asks acceptors to promise not to accept older proposals
Accept: If majority promise, proposer sends accept request
Learn: Acceptors notify learners of chosen value

Use Cases: Chubby, Spanner, Zookeeper (ZAB is Paxos variant)

Raft

Executive Summary: Understandable consensus algorithm. Equivalent to Paxos but easier to implement. Leader-based approach.

Key Concepts:

Leader Election: One leader at a time, elected by majority
Log Replication: Leader replicates log entries to followers
Safety: Only leader with most up-to-date log can win election

Terms: Logical time periods, new term for each election

States: Follower → Candidate → Leader

Use Cases: etcd, Consul, CockroachDB, TiKV

Paxos	Raft
Multi-decree complex	Simpler leader-based
Academic	Practical
Harder to implement	Well-documented

ZAB (Zookeeper Atomic Broadcast)

Executive Summary: Paxos variant optimized for primary-backup systems. Used by Apache Zookeeper.

Phases:

Discovery
Synchronization
Broadcast

Viewstamped Replication

Executive Summary: Early consensus protocol, precursor to Raft. Uses view changes for leader election.

Leader Election

Executive Summary: Process of selecting a single coordinator node. Critical for primary-backup systems, distributed locks, coordination.

Algorithms:

Bully Algorithm: Highest ID wins
Ring Algorithm: Token passing
Raft/Paxos: Consensus-based election

Challenges:

Network partitions
Split-brain
Byzantine failures

Tools: Zookeeper, etcd, Consul

Clock Synchronization

Wall Clock vs Monotonic Clock

Executive Summary: Wall clocks can jump (NTP sync), monotonic clocks only move forward. Use monotonic for measuring durations.

Wall Clock	Monotonic Clock
Can go backwards	Always increases
NTP synchronized	Local to machine
For timestamps	For durations

Vector Clocks

Executive Summary: Track causality in distributed systems. Each node maintains vector of counters. Detect concurrent events.

Node A: [A:1, B:0, C:0]
Node B: [A:1, B:1, C:0]  (after receiving from A)

Comparison:

V1 < V2: V1 happened-before V2
V1

Use Cases: Conflict detection in eventual consistency (Dynamo, Riak)

Lamport Timestamps

Executive Summary: Simpler than vector clocks. Single counter per node. Provides partial ordering.

Rules:

Increment counter before each event
On send: include counter in message
On receive: counter = max(local, received) + 1

Limitation: Cannot detect concurrent events

Hybrid Logical Clocks (HLC)

Executive Summary: Combine physical time with logical counters. Causality + real-time proximity.

Used by CockroachDB, Spanner
Physical component for human-readable timestamps
Logical component for ordering

Failure Detection

Heartbeat

Executive Summary: Nodes periodically send “I’m alive” messages. Simple but can have false positives during network issues.

Parameters:

Heartbeat interval
Timeout threshold
Number of missed heartbeats

Phi Accrual Failure Detector

Executive Summary: Probabilistic failure detection. Outputs suspicion level (φ) rather than binary up/down.

Adapts to network conditions
Used by Cassandra, Akka
Higher φ = more suspicious of failure

SWIM Protocol

Executive Summary: Scalable Weakly-consistent Infection-style Membership. Efficient failure detection via random probing.

How it works:

Pick random node, send ping
If no response, ask k other nodes to probe
If still no response, mark suspected
Disseminate membership via piggybacked gossip

Use Cases: HashiCorp Memberlist, Serf

Gossip Protocol

Executive Summary: Epidemic-style information dissemination. Nodes randomly exchange information with peers. Eventually all nodes converge.

Properties:

Scalable (O(log n) rounds to converge)
Fault tolerant
Decentralized
Eventually consistent

Use Cases:

Failure detection
Membership management
Data dissemination
Aggregate computation

Implementations: Cassandra, Consul, Riak

Data Replication

Synchronous vs Asynchronous

Synchronous	Asynchronous
Wait for all replicas	Fire and forget
Stronger consistency	Higher availability
Higher latency	Lower latency
Simpler reasoning	Potential data loss

Semi-Synchronous

Executive Summary: Wait for at least one replica acknowledgment. Balance between consistency and latency.

Chain Replication

Executive Summary: Replicas form a chain. Writes go to head, reads from tail. Strong consistency with good throughput.

Write → Head → Node2 → Node3 → Tail (Read)

Properties:

Strong consistency
High read throughput
Write latency = chain length
Tail handles all reads

Quorum

Executive Summary: Require majority agreement for operations. W + R > N ensures overlap between read and write sets.

Common configurations:

W=N, R=1: Strong consistency, slow writes
W=1, R=N: Fast writes, slow reads
W=R=⌈(N+1)/2⌉: Balanced

Sloppy Quorum: Accept writes even if preferred nodes unavailable (with hinted handoff)

Anti-Entropy Protocols

Read Repair

Executive Summary: Fix inconsistencies during normal reads. Compare versions from multiple replicas, update stale ones.

Anti-Entropy (Merkle Trees)

Executive Summary: Background process to detect and fix inconsistencies. Use Merkle trees to efficiently compare large datasets.

Process:

Build Merkle tree of data
Compare root hashes
If different, traverse to find divergent branches
Only transfer/fix differing data

Hinted Handoff

Executive Summary: When target node is down, store write hint on available node. Deliver when target recovers.

Improves write availability
Temporary inconsistency acceptable
Used by Dynamo, Cassandra

Conflict Resolution

Last-Write-Wins (LWW)

Executive Summary: Highest timestamp wins. Simple but can lose updates. Requires synchronized clocks.

Problems:

Data loss
Clock skew issues
Not suitable for all use cases

Vector Clocks

Executive Summary: Track version history per replica. Detect conflicts by comparing vectors. Application resolves.

CRDTs (Conflict-free Replicated Data Types)

Executive Summary: Data structures that mathematically guarantee convergence. No coordination needed.

Types:

G-Counter: Grow-only counter
PN-Counter: Increment/decrement counter
G-Set: Grow-only set
OR-Set: Observed-Remove set
LWW-Register: Last-writer-wins register

Use Cases:

Collaborative editing
Shopping carts
Distributed counters

Implementations: Redis CRDT, Riak

Idempotency

Executive Summary: Operation can be applied multiple times with same effect. Critical for safe retries in distributed systems.

Techniques:

Idempotency keys: Client-generated unique ID per operation
Deduplication: Track processed request IDs
Conditional updates: Compare-and-swap

Naturally idempotent:

GET, PUT, DELETE (if same parameters)
Set value to X

Not idempotent:

POST (create new resource)
Increment counter

Circuit Breaker

Executive Summary: Prevent cascading failures by failing fast. Stop calling failing service, allow recovery time.

States:

Closed: Normal operation, track failures
Open: Fail immediately, don’t call service
Half-Open: Test if service recovered

Closed → (failures exceed threshold) → Open
Open → (timeout expires) → Half-Open
Half-Open → (success) → Closed
Half-Open → (failure) → Open

Implementation: Hystrix, Resilience4j, Polly

Bulkhead Pattern

Executive Summary: Isolate components to contain failures. Like ship bulkheads preventing flooding.

Types:

Thread pool isolation: Separate pools per service
Connection pool isolation: Separate pools per dependency
Semaphore isolation: Limit concurrent requests

Backpressure

Executive Summary: Mechanism for slow consumers to signal producers to slow down. Prevents buffer overflow and memory issues.

Strategies:

Drop: Discard excess messages
Buffer: Queue messages (limited)
Block: Block producer
Rate limit: Throttle producer

Implementations: Reactive Streams, TCP flow control

Saga Pattern

Executive Summary: Manage distributed transactions via sequence of local transactions. Each step has compensating action for rollback.

Choreography

Executive Summary: Services react to events, no central coordinator. Simpler but harder to track.

OrderService → Payment Event → PaymentService → Shipping Event → ...

Orchestration

Executive Summary: Central orchestrator directs the flow. Easier to understand but single point of failure.

Orchestrator → OrderService
Orchestrator → PaymentService
Orchestrator → ShippingService

Compensation: Each step defines undo action

CreateOrder → CancelOrder
ChargePayment → RefundPayment
ReserveInventory → ReleaseInventory

Event Sourcing

Executive Summary: Store all changes as immutable events instead of current state. Derive state by replaying events.

Benefits:

Complete audit trail
Time travel (reconstruct any past state)
Event replay for debugging
Natural fit for CQRS

Challenges:

Event schema evolution
Large event stores
Complexity

Event Store: Append-only log

EventStoreDB
Kafka as event store

CQRS (Command Query Responsibility Segregation)

Executive Summary: Separate read and write models. Optimize each independently. Often paired with Event Sourcing.

Write Path: Command → Aggregate → Event → Event Store
Read Path: Event Store → Projector → Read DB → Query

Benefits:

Optimize reads and writes independently
Different scaling strategies
Simpler models

Use Cases:

High read/write ratio difference
Complex domain models
Event-driven architectures

Outbox Pattern

Executive Summary: Ensure atomicity between database update and event publishing. Write event to outbox table in same transaction.

Begin Transaction
Update business data
Insert event into Outbox table
Commit Transaction
Background process publishes from Outbox
Delete from Outbox after publish confirmed

Benefits:

Guaranteed delivery
At-least-once semantics
No two-phase commit needed

Change Data Capture (CDC)

Executive Summary: Capture database changes as events. Stream to message queue or other systems.

Methods:

Log-based: Read database transaction log (preferred)
Trigger-based: Database triggers capture changes
Timestamp-based: Poll for changed rows

Tools: Debezium, Maxwell, AWS DMS

Use Cases:

Event streaming from legacy systems
Data replication
Cache invalidation
Audit logging

Service Mesh

Executive Summary: Infrastructure layer handling service-to-service communication. Provides observability, security, reliability without code changes.

Features:

Load balancing
Service discovery
Circuit breaking
Mutual TLS
Observability
Traffic management

Components:

Data Plane: Sidecar proxies (Envoy)
Control Plane: Configuration, policy

Implementations: Istio, Linkerd, Consul Connect

Sidecar Pattern

Executive Summary: Deploy helper container alongside main container. Handles cross-cutting concerns without modifying application.

Use Cases:

Logging agents
Service mesh proxy
Configuration refresh
Security

Blue-Green Deployment

Executive Summary: Run two identical production environments. Switch traffic after testing new version.

Blue (current) ←── Traffic
Green (new) ←── Test
After validation: Switch traffic to Green

Benefits: Instant rollback, zero downtime

Canary Deployment

Executive Summary: Gradually roll out to small percentage of users first. Monitor, then expand or rollback.

v1: 95% traffic
v2: 5% traffic (canary)
Monitor metrics...
v1: 50% → v2: 50%
v1: 0% → v2: 100%

Feature Flags

Executive Summary: Control feature availability without deployment. Enable/disable features dynamically.

Use Cases:

Gradual rollout
A/B testing
Kill switches
Beta features

Tools: LaunchDarkly, Split, Flagsmith

Back-of-Envelope Calculations

Common Numbers

L1 cache reference:                     0.5 ns
L2 cache reference:                     7 ns
Main memory reference:                  100 ns
SSD random read:                        150 μs
HDD seek:                               10 ms
Packet roundtrip (same datacenter):     500 μs
Packet roundtrip (CA to Netherlands):   150 ms

Storage

1 byte = 8 bits
1 KB = 1,000 bytes
1 MB = 1,000 KB
1 GB = 1,000 MB
1 TB = 1,000 GB

ASCII char: 1 byte
Unicode char: 2-4 bytes
UUID: 16 bytes
Timestamp: 8 bytes

Scale

1 million = 10^6
1 billion = 10^9
Seconds in day: 86,400 ≈ 10^5
Seconds in year: 31,536,000 ≈ 3 × 10^7

QPS Estimates

Daily Active Users (DAU): X
Avg requests per user per day: Y
QPS = (X × Y) / 86400
Peak QPS ≈ 2-3 × average QPS

System Design Interview Framework

1. Requirements Clarification (3-5 min)

Functional requirements
Non-functional requirements
Scale estimates
Constraints and assumptions

2. Back-of-Envelope Estimation (3-5 min)

QPS
Storage
Bandwidth
Memory (cache)

3. High-Level Design (10-15 min)

Core components
Data flow
APIs
Database schema

4. Deep Dive (15-20 min)

Scaling bottlenecks
Database choices
Caching strategies
Failure handling

5. Wrap Up (3-5 min)

Trade-offs discussed
Future improvements
Edge cases

Scaling Strategies

Vertical Scaling (Scale Up)

Horizontal Scaling (Scale Out)

Distributed System Challenges

The Eight Fallacies of Distributed Computing

Split-Brain Problem

Byzantine Fault Tolerance

Consensus Algorithms

Paxos

Raft

ZAB (Zookeeper Atomic Broadcast)

Viewstamped Replication

Leader Election

Clock Synchronization

Wall Clock vs Monotonic Clock

Vector Clocks

Lamport Timestamps

Hybrid Logical Clocks (HLC)

Failure Detection

Heartbeat

Phi Accrual Failure Detector

SWIM Protocol

Gossip Protocol

Data Replication

Synchronous vs Asynchronous

Semi-Synchronous

Chain Replication

Quorum

Anti-Entropy Protocols

Read Repair

Anti-Entropy (Merkle Trees)

Hinted Handoff

Conflict Resolution

Last-Write-Wins (LWW)

Vector Clocks

CRDTs (Conflict-free Replicated Data Types)

Idempotency

Circuit Breaker

Bulkhead Pattern

Backpressure

Saga Pattern

Choreography

Orchestration

Event Sourcing

CQRS (Command Query Responsibility Segregation)

Outbox Pattern

Change Data Capture (CDC)

Service Mesh

Sidecar Pattern

Blue-Green Deployment

Canary Deployment

Feature Flags

Back-of-Envelope Calculations

Common Numbers

Storage

Scale

QPS Estimates

System Design Interview Framework

1. Requirements Clarification (3-5 min)

2. Back-of-Envelope Estimation (3-5 min)

3. High-Level Design (10-15 min)

4. Deep Dive (15-20 min)

5. Wrap Up (3-5 min)

You may also enjoy

Kibana Search

RAG Learning Tutorial - From First Principles to Production

Notes Tutorials & Revisions

Migrating HTTP APIs to HTTPS in Spring Boot - Planning & Implementation