Recipe

GameDay Design

Structured chaos engineering sessions that prove your system survives real-world failure modes — without the panic.

What It Is

A GameDay is a scheduled event where you deliberately inject failures into a production-like environment and observe how the system — and the team — responds. Think fire drill, not fire.

Core Principles

▸Hypothesis-driven. Every injection starts with “we expect X to happen.”
▸Blast-radius control. Limit scope so you don’t take down prod.
▸Observability first. If you can’t see it, you can’t learn from it.
▸Blameless postmortem. Findings go into action items, not personnel files.

Runbook

Define the scope. Pick one service, one failure mode, one metric to watch.
Write the hypothesis. “If we kill the primary DB, the read replica takes over in under 5 seconds.”
Instrument everything. Dashboards, alerts, logs — all up before you start.
Inject the fault. Use tools like Chaos Mesh, Gremlin, or a simple iptables rule.
Observe and record. Did the system behave as expected? What surprised you?
Debrief. Within 24 hours, document findings and assign remediation owners.

Common Injections

Network latency

Add 200ms to inter-service calls.

Pod kill

Terminate a random pod mid-request.

DNS failure

Drop DNS responses for 30 seconds.

Disk fill

Write until the volume hits 95%.

Next Steps

Start small — one team, one hour, one failure mode. Build the muscle before you scale. When you’re ready, explore the Incident Response recipe to close the loop.