Resilience & Reliability

In distributed systems, failure is not an exception — it's the norm. Design for it explicitly.


Resilience Patterns Overview

graph TD
    A[Incoming Request] --> B[Rate Limiter]
    B --> C[Circuit Breaker]
    C -->|Open - Fail Fast| D[Fallback]
    C -->|Closed| E[Bulkhead - Thread Pool]
    E --> F[Timeout]
    F --> G[Downstream Service]
    G -->|Failure| H[Retry with Backoff]
    H -->|Exhausted| D

Circuit Breaker

States:

State Behavior Transition
Closed Requests pass through; failures counted → Open when threshold exceeded
Open All requests fail immediately (fast fail); no calls made → Half-Open after recovery timeout
Half-Open Limited probe requests sent to test recovery → Closed if success; → Open if fails
stateDiagram-v2
    [*] --> Closed
    Closed --> Open: failure rate exceeds threshold
    Open --> HalfOpen: recovery timeout elapsed
    HalfOpen --> Closed: probe succeed
    HalfOpen --> Open: probe fails
  • Threshold: Open after N failures or X% error rate in a sliding time window
  • Libraries: Resilience4j (Spring Boot), Polly (.NET); Hystrix is deprecated
  • Key insight: Circuit breaker prevents cascading failure; retry recovers from transient failure

Deep Dive: Circuit Breaker — States, thresholds, Resilience4j configuration, fallbacks


Retry Pattern

Concept Description
Exponential backoff Wait time doubles on each retry: 1s, 2s, 4s, 8s...
Jitter Add random variance to backoff — prevents thundering herd when all clients retry simultaneously
Max retries Always cap retries; uncapped = infinite load on a struggling service
Idempotency required Only retry operations that are safe to repeat without side effects
Retryable errors 503, 429, connection timeout — NOT 4xx client errors

Thundering herd: All clients retry simultaneously after a failure, causing a traffic spike that re-causes the failure. Jitter solves this.

Deep Dive: Retry Pattern — Exponential backoff, jitter, idempotency requirements


Bulkhead Pattern

Isolate resources so a failure in one area cannot exhaust the entire system.

Type Implementation
Thread pool isolation Separate thread pool per downstream dependency; one slow service can't starve all others
Connection pool isolation Separate DB/HTTP connection pool per consumer group
Process isolation Separate deployments for critical vs non-critical paths (pricing can't kill checkout)

Named after the watertight compartments in a ship hull.

Deep Dive: Bulkhead Pattern — Thread pool isolation, connection pool isolation, sizing


Timeout Pattern

Type Description
Connect timeout Max time to establish a TCP connection
Read timeout Max time to receive a complete response
Write timeout Max time to send a request body
Deadline propagation Pass remaining timeout budget through service call chains (grpc-timeout, X-Request-Deadline)

Always set timeouts. An unconfigured timeout = each thread waiting indefinitely = thread pool exhaustion = service outage.

Deep Dive: Timeout Pattern — Connect/read/write timeouts, deadline propagation


Rate Limiting & Throttling

Algorithm How It Works Properties
Token Bucket Tokens refill at fixed rate; consume one per request; burst allowed up to bucket size Smooth average; allows bursts
Sliding Window Count requests in rolling time window Precise; more memory
Fixed Window Count resets at fixed interval Simple; allows burst at boundary
Leaky Bucket Requests queued and processed at fixed rate Smoothest output; no burst
  • Response: 429 Too Many Requests
  • Response headers: Retry-After, X-RateLimit-Limit, X-RateLimit-Remaining

Deep Dive: Rate Limiting — Token bucket, sliding window, distributed rate limiting


Fallback Pattern

Type Description Example
Default value Return sensible constant Empty product list, default price
Cached response Return last known good response Product details cache
Degraded mode Return partial data; skip non-critical sections Show order without real-time stock
Stub response Hardcoded response for maintenance or testing Feature flag disabled

Idempotency

Concept Description
Idempotent operation Call it N times = same result as calling it once
Idempotency key Client generates unique UUID per request; server stores it and deduplicates
HTTP verbs GET, PUT, DELETE are idempotent by specification; POST is not
Why it matters Enables safe retries in at-least-once messaging; prevents duplicate charges

Deep Dive: Fallback and Idempotency — Fallback strategies, idempotency keys, graceful degradation


Health Checks

Type Question Kubernetes Probe
Liveness Is the app alive and not deadlocked? livenessProbe — restart pod if fails
Readiness Is the app ready to serve traffic? readinessProbe — remove from load balancer if fails
Startup Has the app finished its slow initialization? startupProbe — delays liveness/readiness checks

Backpressure

Concept Description
Problem Fast producer overwhelms a slow consumer
Solution Producer slows down or stops when consumer signals it's overwhelmed
Reactive Streams Publisher, Subscriber, Subscription protocol — backpressure built in
Kafka signal Monitor consumer lag; alert when partition lag grows unbounded

Chaos Engineering

Concept Description
Steady state Define and measure normal behavior before injecting failures
Blast radius Start small; inject failure in one pod, one AZ, one service
Fault types Kill pods, add network latency, drop packets, exhaust resources
Tools Netflix Chaos Monkey, Chaos Toolkit, AWS Fault Injection Simulator, Litmus (K8s)
Goal Verify resilience claims before incidents do it for you

Deep Dive: Chaos Engineering — GameDays, fault injection, Chaos Mesh, blast radius control