Chaos Engineering — Deep Dive

Level: Advanced
Pre-reading: 06 · Resilience & Reliability


What is Chaos Engineering?

Chaos Engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production.

"Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production." — Principles of Chaos Engineering


Why Chaos Engineering?

Without Chaos Engineering With Chaos Engineering
"We think it's resilient" "We've tested it fails gracefully"
Find issues in production incidents Find issues before they become incidents
Hope circuit breakers work Know circuit breakers work
Untested runbooks Validated recovery procedures

The Chaos Engineering Process

graph LR
    A[Define Steady State] --> B[Form Hypothesis]
    B --> C[Plan Experiment]
    C --> D[Run Experiment]
    D --> E[Analyze Results]
    E --> F[Improve System]
    F --> A

1. Define Steady State

What does "healthy" look like? Define metrics that indicate normal operation.

Metric Steady State
Error rate < 0.1%
p99 latency < 500ms
Throughput 1000 req/s
Order completion > 99%

2. Form Hypothesis

"When we kill one pod of the payment service, the system will continue processing orders with < 1% additional latency and 0% error rate increase."

3. Run Experiment

Inject the failure in a controlled way.

4. Analyze Results

Did the system behave as hypothesized? If not, why?

5. Improve

Fix weaknesses found. Re-run experiment to verify fix.


Types of Failures to Inject

Category Failures
Compute Kill pod, kill node, CPU exhaustion
Network Latency, packet loss, partition
Dependency Database down, API unavailable
Resource Memory exhaustion, disk full
Time Clock skew, slow DNS

Chaos Engineering Tools

Kubernetes-Native

Tool Description
Chaos Mesh CNCF incubating; comprehensive K8s chaos
Litmus CNCF sandbox; workflow-based
Chaos Monkey Netflix original; random instance termination
Pumba Docker container chaos

Cloud-Native

Tool Description
AWS Fault Injection Simulator AWS-native; managed service
Azure Chaos Studio Azure-native
Gremlin Commercial; enterprise features

Code-Level

Tool Description
Toxiproxy Network condition simulation
Byteman JVM fault injection
Failsafe Programmatic failure injection

Chaos Mesh Example

Pod Kill

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-order-service
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: order-service
  scheduler:
    cron: "@every 2h"

Network Delay

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay-payment
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  delay:
    latency: "200ms"
    jitter: "50ms"
  duration: "5m"

Stress Test

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress
spec:
  mode: one
  selector:
    labelSelectors:
      app: order-service
  stressors:
    cpu:
      workers: 2
      load: 80
  duration: "10m"

Blast Radius Control

Start Small

graph TD
    A[Start: 1 pod in dev] --> B[1 pod in staging]
    B --> C[1 pod in prod]
    C --> D[1 node in prod]
    D --> E[1 AZ in prod]

Scope Control

Scope Blast Radius
Single pod Minimal; tests pod-level resilience
Single node Tests node failure handling
Single AZ Tests multi-AZ resilience
Single region Tests multi-region failover

Guardrails

# Chaos Mesh abort conditions
spec:
  # Maximum impact duration
  duration: "10m"

  # Pause if error rate exceeds threshold
  abort:
    - condition: "error_rate > 5%"
    - condition: "p99_latency > 2s"

GameDay

A GameDay is a planned chaos engineering exercise with the team.

GameDay Structure

Phase Duration Activity
Prep 1-2 weeks Define scenarios, prepare runbooks
Briefing 30 min Review scenarios, assign roles
Execution 2-4 hours Run experiments, observe, respond
Debrief 1 hour Review findings, identify improvements
Follow-up 1-2 weeks Implement fixes

GameDay Roles

Role Responsibility
Facilitator Runs the exercise; controls pace
Injector Executes failure injections
Responder On-call engineers responding
Observer Documents findings
Safety Stops exercise if needed

Steady State Verification

Automated Checks

public class SteadyStateVerifier {

    public boolean verifyOrderService() {
        double errorRate = metrics.getErrorRate("order-service", Duration.ofMinutes(5));
        double p99Latency = metrics.getP99Latency("order-service", Duration.ofMinutes(5));
        double throughput = metrics.getThroughput("order-service", Duration.ofMinutes(5));

        return errorRate < 0.01 
            && p99Latency < 500 
            && throughput > 900;
    }
}

Continuous Verification

# Run chaos experiments continuously
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: continuous-chaos
spec:
  schedule: "0 */4 * * *"  # Every 4 hours
  type: "PodChaos"
  podChaos:
    action: pod-kill
    mode: one
    selector:
      labelSelectors:
        app: order-service

Chaos Engineering Maturity

Level Description
0: None No chaos engineering
1: Ad-hoc Manual experiments in dev/staging
2: Planned Scheduled GameDays
3: Automated Continuous chaos in staging
4: Production Controlled chaos in production
5: Culture Chaos as part of development lifecycle

Best Practices

Practice Rationale
Start in non-prod Learn safely
Define steady state Know what normal looks like
Small blast radius Minimize impact
Have rollback plan Stop experiments quickly
Monitor closely Observe system behavior
Communicate Inform stakeholders
Document findings Build knowledge base

Common Mistakes

Mistake Impact Fix
No steady state definition Don't know what to measure Define SLIs before experimenting
Too large blast radius Real outage Start small; expand gradually
No abort conditions Experiment causes damage Set automatic abort thresholds
Running in prod first Incident Start in dev/staging
No follow-up Findings not addressed Track and fix issues found

How do you convince management to allow chaos engineering in production?

(1) Start small — prove value in non-prod first. (2) Show ROI — document prevented incidents from findings. (3) Control blast radius — demonstrate guardrails and abort conditions. (4) Reference industry leaders — Netflix, Amazon, Google all do it. (5) Frame it as testing — "We're testing resilience, not breaking things."

What's the difference between chaos engineering and testing?

Testing verifies known behavior with expected inputs. Chaos engineering explores unknown behavior with unexpected conditions. Tests prove the system works as designed; chaos proves it fails gracefully when things go wrong. Both are needed; chaos engineering supplements testing.

How often should you run chaos experiments?

Dev/Staging: Continuously — run with every deployment. Production: Start with scheduled GameDays (monthly), progress to continuous experiments as maturity grows. Critical: run after any significant change. The goal is to catch regressions early, not just find initial weaknesses.