Recipe

Runbook Design

Build structured, repeatable incident response runbooks that your team can execute under pressure without second-guessing.

Why Runbooks Matter

A runbook turns tribal knowledge into a shared asset. When an alert fires at 3 AM, the on-call engineer follows a tested path — not a Slack scrollback. Good runbooks reduce mean time to resolution and prevent the same incident from escalating twice.

Core Structure

Trigger

What alert, metric, or page starts this runbook.

Triage

Quick checks to confirm severity and scope.

Mitigate

Stop the bleeding — rollback, drain traffic, kill pods.

Resolve

Root cause fix with verification steps.

Postmortem

Timeline, impact, and action items.

Automate

Turn manual steps into code or alerts.

Design Principles

Assume stress. Write for someone who just woke up. Use bold keywords, not paragraphs.
One path per scenario. Branching logic lives in separate runbooks, not inline if-trees.
Every step is verifiable. Include expected output or a check command so the operator knows it worked.
Version and review. Runbooks rot. Schedule quarterly drills and update stale steps.

Pro tip

Store runbooks alongside your alert definitions. When PagerDuty fires, the runbook link is in the notification — zero clicks to the first triage step.