Resilience & Reliability

In distributed systems, failure is not an exception — it's the norm. Design for it explicitly.

Resilience Patterns Overview

graph TD
    A[Incoming Request] --> B[Rate Limiter]
    B --> C[Circuit Breaker]
    C -->|Open - Fail Fast| D[Fallback]
    C -->|Closed| E[Bulkhead - Thread Pool]
    E --> F[Timeout]
    F --> G[Downstream Service]
    G -->|Failure| H[Retry with Backoff]
    H -->|Exhausted| D

Circuit Breaker

States:

State	Behavior	Transition
Closed	Requests pass through; failures counted	→ Open when threshold exceeded
Open	All requests fail immediately (fast fail); no calls made	→ Half-Open after recovery timeout
Half-Open	Limited probe requests sent to test recovery	→ Closed if success; → Open if fails

stateDiagram-v2
    [*] --> Closed
    Closed --> Open: failure rate exceeds threshold
    Open --> HalfOpen: recovery timeout elapsed
    HalfOpen --> Closed: probe succeed
    HalfOpen --> Open: probe fails

Threshold: Open after N failures or X% error rate in a sliding time window
Libraries: Resilience4j (Spring Boot), Polly (.NET); Hystrix is deprecated
Key insight: Circuit breaker prevents cascading failure; retry recovers from transient failure

→ Deep Dive: Circuit Breaker — States, thresholds, Resilience4j configuration, fallbacks

Retry Pattern

Concept	Description
Exponential backoff	Wait time doubles on each retry: 1s, 2s, 4s, 8s...
Jitter	Add random variance to backoff — prevents thundering herd when all clients retry simultaneously
Max retries	Always cap retries; uncapped = infinite load on a struggling service
Idempotency required	Only retry operations that are safe to repeat without side effects
Retryable errors	503, 429, connection timeout — NOT 4xx client errors

Thundering herd: All clients retry simultaneously after a failure, causing a traffic spike that re-causes the failure. Jitter solves this.

→ Deep Dive: Retry Pattern — Exponential backoff, jitter, idempotency requirements

Bulkhead Pattern

Isolate resources so a failure in one area cannot exhaust the entire system.

Type	Implementation
Thread pool isolation	Separate thread pool per downstream dependency; one slow service can't starve all others
Connection pool isolation	Separate DB/HTTP connection pool per consumer group
Process isolation	Separate deployments for critical vs non-critical paths (pricing can't kill checkout)

Named after the watertight compartments in a ship hull.

→ Deep Dive: Bulkhead Pattern — Thread pool isolation, connection pool isolation, sizing

Timeout Pattern

Type	Description
Connect timeout	Max time to establish a TCP connection
Read timeout	Max time to receive a complete response
Write timeout	Max time to send a request body
Deadline propagation	Pass remaining timeout budget through service call chains (`grpc-timeout`, `X-Request-Deadline`)

Always set timeouts. An unconfigured timeout = each thread waiting indefinitely = thread pool exhaustion = service outage.

→ Deep Dive: Timeout Pattern — Connect/read/write timeouts, deadline propagation

Rate Limiting & Throttling

Algorithm	How It Works	Properties
Token Bucket	Tokens refill at fixed rate; consume one per request; burst allowed up to bucket size	Smooth average; allows bursts
Sliding Window	Count requests in rolling time window	Precise; more memory
Fixed Window	Count resets at fixed interval	Simple; allows burst at boundary
Leaky Bucket	Requests queued and processed at fixed rate	Smoothest output; no burst

Response: 429 Too Many Requests
Response headers: Retry-After, X-RateLimit-Limit, X-RateLimit-Remaining

→ Deep Dive: Rate Limiting — Token bucket, sliding window, distributed rate limiting

Fallback Pattern

Type	Description	Example
Default value	Return sensible constant	Empty product list, default price
Cached response	Return last known good response	Product details cache
Degraded mode	Return partial data; skip non-critical sections	Show order without real-time stock
Stub response	Hardcoded response for maintenance or testing	Feature flag disabled

Idempotency

Concept	Description
Idempotent operation	Call it N times = same result as calling it once
Idempotency key	Client generates unique UUID per request; server stores it and deduplicates
HTTP verbs	GET, PUT, DELETE are idempotent by specification; POST is not
Why it matters	Enables safe retries in at-least-once messaging; prevents duplicate charges

→ Deep Dive: Fallback and Idempotency — Fallback strategies, idempotency keys, graceful degradation

Health Checks

Type	Question	Kubernetes Probe
Liveness	Is the app alive and not deadlocked?	`livenessProbe` — restart pod if fails
Readiness	Is the app ready to serve traffic?	`readinessProbe` — remove from load balancer if fails
Startup	Has the app finished its slow initialization?	`startupProbe` — delays liveness/readiness checks

Backpressure

Concept	Description
Problem	Fast producer overwhelms a slow consumer
Solution	Producer slows down or stops when consumer signals it's overwhelmed
Reactive Streams	`Publisher`, `Subscriber`, `Subscription` protocol — backpressure built in
Kafka signal	Monitor consumer lag; alert when partition lag grows unbounded

Chaos Engineering

Concept	Description
Steady state	Define and measure normal behavior before injecting failures
Blast radius	Start small; inject failure in one pod, one AZ, one service
Fault types	Kill pods, add network latency, drop packets, exhaust resources
Tools	Netflix Chaos Monkey, Chaos Toolkit, AWS Fault Injection Simulator, Litmus (K8s)
Goal	Verify resilience claims before incidents do it for you

→ Deep Dive: Chaos Engineering — GameDays, fault injection, Chaos Mesh, blast radius control

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search