Recipe

Chaos Testing

Inject failure deliberately to discover where your system breaks before your users do.

Core Loop

Define a steady-state hypothesis — what does “healthy” look like?
Introduce a fault: kill a pod, spike latency, corrupt a response.
Measure blast radius. Did the circuit breaker trip? Did the fallback serve stale data?
Roll back the fault and verify the system returns to steady state.

Fault Menu

network-latency — add 500ms to egress
dns-failure — resolve upstream to 127.0.0.1
disk-fill — write 1GB junk to /tmp
process-kill — SIGKILL a random worker

Guardrails

Never run chaos experiments in production without a kill-switch. Start in staging, limit blast radius with a canary deployment, and schedule experiments during business hours so on-call is awake.