Chaos Engineering — Deep Dive

Level: Advanced
Pre-reading: 06 · Resilience & Reliability

What is Chaos Engineering?

Chaos Engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production.

"Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production." — Principles of Chaos Engineering

Why Chaos Engineering?

Without Chaos Engineering	With Chaos Engineering
"We think it's resilient"	"We've tested it fails gracefully"
Find issues in production incidents	Find issues before they become incidents
Hope circuit breakers work	Know circuit breakers work
Untested runbooks	Validated recovery procedures

The Chaos Engineering Process

graph LR
    A[Define Steady State] --> B[Form Hypothesis]
    B --> C[Plan Experiment]
    C --> D[Run Experiment]
    D --> E[Analyze Results]
    E --> F[Improve System]
    F --> A

1. Define Steady State

What does "healthy" look like? Define metrics that indicate normal operation.

Metric	Steady State
Error rate	< 0.1%
p99 latency	< 500ms
Throughput	1000 req/s
Order completion	> 99%

2. Form Hypothesis

"When we kill one pod of the payment service, the system will continue processing orders with < 1% additional latency and 0% error rate increase."

3. Run Experiment

Inject the failure in a controlled way.

4. Analyze Results

Did the system behave as hypothesized? If not, why?

5. Improve

Fix weaknesses found. Re-run experiment to verify fix.

Types of Failures to Inject

Category	Failures
Compute	Kill pod, kill node, CPU exhaustion
Network	Latency, packet loss, partition
Dependency	Database down, API unavailable
Resource	Memory exhaustion, disk full
Time	Clock skew, slow DNS

Chaos Engineering Tools

Kubernetes-Native

Tool	Description
Chaos Mesh	CNCF incubating; comprehensive K8s chaos
Litmus	CNCF sandbox; workflow-based
Chaos Monkey	Netflix original; random instance termination
Pumba	Docker container chaos

Cloud-Native

Tool	Description
AWS Fault Injection Simulator	AWS-native; managed service
Azure Chaos Studio	Azure-native
Gremlin	Commercial; enterprise features

Code-Level

Tool	Description
Toxiproxy	Network condition simulation
Byteman	JVM fault injection
Failsafe	Programmatic failure injection

Chaos Mesh Example

Pod Kill

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-order-service
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: order-service
  scheduler:
    cron: "@every 2h"

Network Delay

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay-payment
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  delay:
    latency: "200ms"
    jitter: "50ms"
  duration: "5m"

Stress Test

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress
spec:
  mode: one
  selector:
    labelSelectors:
      app: order-service
  stressors:
    cpu:
      workers: 2
      load: 80
  duration: "10m"

Blast Radius Control

Start Small

graph TD
    A[Start: 1 pod in dev] --> B[1 pod in staging]
    B --> C[1 pod in prod]
    C --> D[1 node in prod]
    D --> E[1 AZ in prod]

Scope Control

Scope	Blast Radius
Single pod	Minimal; tests pod-level resilience
Single node	Tests node failure handling
Single AZ	Tests multi-AZ resilience
Single region	Tests multi-region failover

Guardrails

# Chaos Mesh abort conditions
spec:
  # Maximum impact duration
  duration: "10m"

  # Pause if error rate exceeds threshold
  abort:
    - condition: "error_rate > 5%"
    - condition: "p99_latency > 2s"

GameDay

A GameDay is a planned chaos engineering exercise with the team.

GameDay Structure

Phase	Duration	Activity
Prep	1-2 weeks	Define scenarios, prepare runbooks
Briefing	30 min	Review scenarios, assign roles
Execution	2-4 hours	Run experiments, observe, respond
Debrief	1 hour	Review findings, identify improvements
Follow-up	1-2 weeks	Implement fixes

GameDay Roles

Role	Responsibility
Facilitator	Runs the exercise; controls pace
Injector	Executes failure injections
Responder	On-call engineers responding
Observer	Documents findings
Safety	Stops exercise if needed

Steady State Verification

Automated Checks

public class SteadyStateVerifier {

    public boolean verifyOrderService() {
        double errorRate = metrics.getErrorRate("order-service", Duration.ofMinutes(5));
        double p99Latency = metrics.getP99Latency("order-service", Duration.ofMinutes(5));
        double throughput = metrics.getThroughput("order-service", Duration.ofMinutes(5));

        return errorRate < 0.01 
            && p99Latency < 500 
            && throughput > 900;
    }
}

Continuous Verification

# Run chaos experiments continuously
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: continuous-chaos
spec:
  schedule: "0 */4 * * *"  # Every 4 hours
  type: "PodChaos"
  podChaos:
    action: pod-kill
    mode: one
    selector:
      labelSelectors:
        app: order-service

Chaos Engineering Maturity

Level	Description
0: None	No chaos engineering
1: Ad-hoc	Manual experiments in dev/staging
2: Planned	Scheduled GameDays
3: Automated	Continuous chaos in staging
4: Production	Controlled chaos in production
5: Culture	Chaos as part of development lifecycle

Best Practices

Practice	Rationale
Start in non-prod	Learn safely
Define steady state	Know what normal looks like
Small blast radius	Minimize impact
Have rollback plan	Stop experiments quickly
Monitor closely	Observe system behavior
Communicate	Inform stakeholders
Document findings	Build knowledge base

Common Mistakes

Mistake	Impact	Fix
No steady state definition	Don't know what to measure	Define SLIs before experimenting
Too large blast radius	Real outage	Start small; expand gradually
No abort conditions	Experiment causes damage	Set automatic abort thresholds
Running in prod first	Incident	Start in dev/staging
No follow-up	Findings not addressed	Track and fix issues found

How do you convince management to allow chaos engineering in production?

(1) Start small — prove value in non-prod first. (2) Show ROI — document prevented incidents from findings. (3) Control blast radius — demonstrate guardrails and abort conditions. (4) Reference industry leaders — Netflix, Amazon, Google all do it. (5) Frame it as testing — "We're testing resilience, not breaking things."

What's the difference between chaos engineering and testing?

Testing verifies known behavior with expected inputs. Chaos engineering explores unknown behavior with unexpected conditions. Tests prove the system works as designed; chaos proves it fails gracefully when things go wrong. Both are needed; chaos engineering supplements testing.

How often should you run chaos experiments?

Dev/Staging: Continuously — run with every deployment. Production: Start with scheduled GameDays (monthly), progress to continuous experiments as maturity grows. Critical: run after any significant change. The goal is to catch regressions early, not just find initial weaknesses.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search