Skip to content

ADR-016: Chaos Engineering Framework

Status

Accepted

Context

To build confidence in system reliability and validate resilience patterns, BitVelocity needs systematic chaos engineering practices. This provides hands-on learning of: - Failure mode analysis - Incident response - Observability under stress - Circuit breaker and retry patterns - Distributed system failure scenarios

Decision

Chaos Engineering Platform

Chaos Mesh - Selected for: - Native Kubernetes integration - Rich experiment types - Easy YAML-based configuration - Good documentation - Active community - Free and open-source

Experiment Categories

1. Pod Chaos

  • Pod kill
  • Pod failure
  • Container kill
  • Specific container restart

2. Network Chaos

  • Network delay/latency
  • Packet loss
  • Network partition
  • Bandwidth limitation
  • DNS failure

3. Stress Chaos

  • CPU stress
  • Memory stress
  • I/O stress

4. Time Chaos

  • Clock skew
  • Time offset

5. Application Chaos

  • JVM chaos (future)
  • HTTP chaos
  • Kernel chaos

Safety Guidelines

Blast Radius Control: 1. Start with mode: one (single pod) 2. Progress to mode: fixed with specific count 3. Never use mode: all in shared environments 4. Use label selectors carefully

Environment Strategy:

  • Dev: Open experimentation
  • Staging: Scheduled chaos (nightly)
  • Production: Manual game days only (future)

Always Have:

  • Rollback plan (pause/delete experiment)
  • Active monitoring during experiment
  • Team notification before running
  • Post-mortem documentation

Game Days

Structured chaos exercises with team participation:

Format: 1. Preparation (15min): Set baseline, verify monitoring 2. Chaos Injection (30min): Run experiments, observe 3. Incident Response (30min): Team responds as in production 4. Recovery (15min): Clean up, verify restoration 5. Debrief (30min): Discuss learnings, action items

Frequency:

  • Monthly game days for primary domains
  • Quarterly cross-domain scenarios
  • Ad-hoc for new features

Experiment Workflow

# Standard experiment template
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: <service>-<type>-<scenario>
  namespace: bitvelocity
spec:
  action: <pod-kill | pod-failure>
  mode: <one | fixed | all>
  selector:
    namespaces: [bitvelocity]
    labelSelectors:
      app: <service-name>
  duration: '<duration>'
  scheduler:
    cron: '@every <interval>'  # optional

Validation Criteria

Every experiment must define: - Hypothesis: Expected behavior - Blast radius: What can be affected - Success criteria: Metrics/logs to check - Failure indicators: What means experiment failed - Recovery validation: How to confirm system recovered

Integration with Observability

Required instrumentation for chaos engineering: - Circuit breaker state metrics - Retry attempt counters - Error rate by type - Latency percentiles - Resource utilization - Distributed tracing for failure propagation

Documentation

All experiments must be documented with: - Purpose and learning objectives - Expected vs actual behavior - Metrics collected - Findings and improvements identified - Action items created

Consequences

Positive

  • Hands-on resilience pattern learning
  • Identifies weak points before production
  • Builds team confidence in failure handling
  • Improves observability and alerting
  • Documents failure modes and responses
  • Validates architecture decisions

Negative

  • Requires stable observability stack
  • Can be disruptive if not carefully scoped
  • Requires discipline and documentation
  • Initial setup and learning curve
  • May expose uncomfortable truths about system

Trade-offs

  • Chose Chaos Mesh over Litmus for better K8s integration
  • Manual game days over automated chaos (learning > automation)
  • Structured experiments over random chaos (educational value)

Implementation Plan

  1. Phase 1 (Week 1): Setup
  2. Install Chaos Mesh on dev cluster
  3. Create experiment templates
  4. Document safety procedures

  5. Phase 2 (Week 2): Basic Experiments

  6. Pod kill experiments for 3 services
  7. Network latency experiments
  8. Document baseline behavior

  9. Phase 3 (Week 3): Game Day Preparation

  10. Create first game day runbook
  11. Practice with team
  12. Set up communication channels

  13. Phase 4 (Week 4): First Game Day

  14. Run inventory service failure scenario
  15. Document learnings
  16. Create improvement backlog

  17. Phase 5 (Ongoing): Expansion

  18. Add more experiment types
  19. Increase complexity
  20. Introduce cross-service scenarios
  21. Consider CI integration (staging)

Examples

Simple Pod Kill

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: order-service-pod-kill
spec:
  action: pod-kill
  mode: one
  selector:
    labelSelectors:
      app: order-service
  scheduler:
    cron: '@every 2m'

Network Latency

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: inventory-latency
spec:
  action: delay
  mode: all
  selector:
    labelSelectors:
      app: inventory-service
  delay:
    latency: '200ms'
    jitter: '50ms'
  duration: '5m'

References