ADR-016: Chaos Engineering Framework
Status
Accepted
Context
To build confidence in system reliability and validate resilience patterns, BitVelocity needs systematic chaos engineering practices. This provides hands-on learning of: - Failure mode analysis - Incident response - Observability under stress - Circuit breaker and retry patterns - Distributed system failure scenarios
Decision
Chaos Engineering Platform
Chaos Mesh - Selected for: - Native Kubernetes integration - Rich experiment types - Easy YAML-based configuration - Good documentation - Active community - Free and open-source
Experiment Categories
1. Pod Chaos
- Pod kill
- Pod failure
- Container kill
- Specific container restart
2. Network Chaos
- Network delay/latency
- Packet loss
- Network partition
- Bandwidth limitation
- DNS failure
3. Stress Chaos
- CPU stress
- Memory stress
- I/O stress
4. Time Chaos
- Clock skew
- Time offset
5. Application Chaos
- JVM chaos (future)
- HTTP chaos
- Kernel chaos
Safety Guidelines
Blast Radius Control:
1. Start with mode: one (single pod)
2. Progress to mode: fixed with specific count
3. Never use mode: all in shared environments
4. Use label selectors carefully
Environment Strategy:
- Dev: Open experimentation
- Staging: Scheduled chaos (nightly)
- Production: Manual game days only (future)
Always Have:
- Rollback plan (pause/delete experiment)
- Active monitoring during experiment
- Team notification before running
- Post-mortem documentation
Game Days
Structured chaos exercises with team participation:
Format: 1. Preparation (15min): Set baseline, verify monitoring 2. Chaos Injection (30min): Run experiments, observe 3. Incident Response (30min): Team responds as in production 4. Recovery (15min): Clean up, verify restoration 5. Debrief (30min): Discuss learnings, action items
Frequency:
- Monthly game days for primary domains
- Quarterly cross-domain scenarios
- Ad-hoc for new features
Experiment Workflow
# Standard experiment template
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: <service>-<type>-<scenario>
namespace: bitvelocity
spec:
action: <pod-kill | pod-failure>
mode: <one | fixed | all>
selector:
namespaces: [bitvelocity]
labelSelectors:
app: <service-name>
duration: '<duration>'
scheduler:
cron: '@every <interval>' # optional
Validation Criteria
Every experiment must define: - Hypothesis: Expected behavior - Blast radius: What can be affected - Success criteria: Metrics/logs to check - Failure indicators: What means experiment failed - Recovery validation: How to confirm system recovered
Integration with Observability
Required instrumentation for chaos engineering: - Circuit breaker state metrics - Retry attempt counters - Error rate by type - Latency percentiles - Resource utilization - Distributed tracing for failure propagation
Documentation
All experiments must be documented with: - Purpose and learning objectives - Expected vs actual behavior - Metrics collected - Findings and improvements identified - Action items created
Consequences
Positive
- Hands-on resilience pattern learning
- Identifies weak points before production
- Builds team confidence in failure handling
- Improves observability and alerting
- Documents failure modes and responses
- Validates architecture decisions
Negative
- Requires stable observability stack
- Can be disruptive if not carefully scoped
- Requires discipline and documentation
- Initial setup and learning curve
- May expose uncomfortable truths about system
Trade-offs
- Chose Chaos Mesh over Litmus for better K8s integration
- Manual game days over automated chaos (learning > automation)
- Structured experiments over random chaos (educational value)
Implementation Plan
- Phase 1 (Week 1): Setup
- Install Chaos Mesh on dev cluster
- Create experiment templates
-
Document safety procedures
-
Phase 2 (Week 2): Basic Experiments
- Pod kill experiments for 3 services
- Network latency experiments
-
Document baseline behavior
-
Phase 3 (Week 3): Game Day Preparation
- Create first game day runbook
- Practice with team
-
Set up communication channels
-
Phase 4 (Week 4): First Game Day
- Run inventory service failure scenario
- Document learnings
-
Create improvement backlog
-
Phase 5 (Ongoing): Expansion
- Add more experiment types
- Increase complexity
- Introduce cross-service scenarios
- Consider CI integration (staging)
Examples
Simple Pod Kill
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: order-service-pod-kill
spec:
action: pod-kill
mode: one
selector:
labelSelectors:
app: order-service
scheduler:
cron: '@every 2m'
Network Latency
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: inventory-latency
spec:
action: delay
mode: all
selector:
labelSelectors:
app: inventory-service
delay:
latency: '200ms'
jitter: '50ms'
duration: '5m'
References
bv-chaos-experiments/README.mdbv-chaos-experiments/game-days/runbook-inventory-failure.md- Chaos Mesh Documentation
- Principles of Chaos Engineering
- ADR-007: Observability Baseline (monitoring requirements)
- ADR-006: Retry & Backoff Policies (patterns to validate)