Recipe
GameDay Design
Structured chaos engineering sessions that prove your system survives real-world failure modes — without the panic.
What It Is
A GameDay is a scheduled event where you deliberately inject failures into a production-like environment and observe how the system — and the team — responds. Think fire drill, not fire.
Core Principles
- ▸Hypothesis-driven. Every injection starts with “we expect X to happen.”
- ▸Blast-radius control. Limit scope so you don’t take down prod.
- ▸Observability first. If you can’t see it, you can’t learn from it.
- ▸Blameless postmortem. Findings go into action items, not personnel files.
Runbook
- Define the scope. Pick one service, one failure mode, one metric to watch.
- Write the hypothesis. “If we kill the primary DB, the read replica takes over in under 5 seconds.”
- Instrument everything. Dashboards, alerts, logs — all up before you start.
- Inject the fault. Use tools like Chaos Mesh, Gremlin, or a simple iptables rule.
- Observe and record. Did the system behave as expected? What surprised you?
- Debrief. Within 24 hours, document findings and assign remediation owners.
Common Injections
Network latency
Add 200ms to inter-service calls.
Pod kill
Terminate a random pod mid-request.
DNS failure
Drop DNS responses for 30 seconds.
Disk fill
Write until the volume hits 95%.
Next Steps
Start small — one team, one hour, one failure mode. Build the muscle before you scale. When you’re ready, explore the Incident Response recipe to close the loop.