← Docs
Recipe

Chaos Testing

Inject failure deliberately to discover where your system breaks before your users do.

Core Loop

  1. Define a steady-state hypothesis — what does “healthy” look like?
  2. Introduce a fault: kill a pod, spike latency, corrupt a response.
  3. Measure blast radius. Did the circuit breaker trip? Did the fallback serve stale data?
  4. Roll back the fault and verify the system returns to steady state.

Fault Menu

  • network-latency — add 500ms to egress
  • dns-failure — resolve upstream to 127.0.0.1
  • disk-fill — write 1GB junk to /tmp
  • process-kill — SIGKILL a random worker

Guardrails

Never run chaos experiments in production without a kill-switch. Start in staging, limit blast radius with a canary deployment, and schedule experiments during business hours so on-call is awake.