← Back to docsRecipe

On-call rotation design

A repeatable blueprint for building an on-call rotation that engineers don't hate. Covers scheduling, escalation, compensation, and the cultural norms that keep burnout at bay.

1. Define the tiers

Start with two tiers: primary (first responder) and secondary (escalation). The primary handles pages within 5 minutes; the secondary steps in if the primary doesn't ack within 10. Add a tertiary tier only when the team exceeds 8 engineers.

2. Pick a cadence

Weekly rotations are the sweet spot. Daily rotations destroy context; monthly rotations burn people out. Run Monday 10:00 to Monday 10:00 so handoffs happen during working hours. Use a follow-the-sun model if you have engineers across three timezones.

3. Write the runbook

Every alert must link to a runbook. The runbook answers three questions: what broke, how to confirm it, and how to fix it. If the fix takes more than 15 minutes, the runbook must include an escalation path. Store runbooks in a repo next to the code.

4. Compensate fairly

Pay a flat weekly stipend for being on-call plus an hourly rate for pages responded to. If a page fires after midnight, the hourly rate doubles. Time-in-lieu is the minimum — cash is better. Never make on-call "part of the job" without extra pay.

5. Guard the culture

Blameless postmortems are non-negotiable. If an alert fires and the runbook is wrong, the person who got paged is never at fault. Track false-positive rate weekly; if it exceeds 30%, declare a "pager freeze" and fix the noise before rotating anyone back in.

TL;DR

  • Two tiers, weekly rotation, Monday handoff
  • Every alert links to a runbook in the repo
  • Flat stipend + hourly pay, double after midnight
  • Blameless postmortems, pager freeze above 30% false-positive