Recipe
Chaos Testing
Inject failure deliberately to discover where your system breaks before your users do.
Core Loop
- Define a steady-state hypothesis — what does “healthy” look like?
- Introduce a fault: kill a pod, spike latency, corrupt a response.
- Measure blast radius. Did the circuit breaker trip? Did the fallback serve stale data?
- Roll back the fault and verify the system returns to steady state.
Fault Menu
- network-latency — add 500ms to egress
- dns-failure — resolve upstream to 127.0.0.1
- disk-fill — write 1GB junk to /tmp
- process-kill — SIGKILL a random worker
Guardrails
Never run chaos experiments in production without a kill-switch. Start in staging, limit blast radius with a canary deployment, and schedule experiments during business hours so on-call is awake.