On-call rotation design
A repeatable blueprint for building an on-call rotation that engineers don't hate. Covers scheduling, escalation, compensation, and the cultural norms that keep burnout at bay.
1. Define the tiers
Start with two tiers: primary (first responder) and secondary (escalation). The primary handles pages within 5 minutes; the secondary steps in if the primary doesn't ack within 10. Add a tertiary tier only when the team exceeds 8 engineers.
2. Pick a cadence
Weekly rotations are the sweet spot. Daily rotations destroy context; monthly rotations burn people out. Run Monday 10:00 to Monday 10:00 so handoffs happen during working hours. Use a follow-the-sun model if you have engineers across three timezones.
3. Write the runbook
Every alert must link to a runbook. The runbook answers three questions: what broke, how to confirm it, and how to fix it. If the fix takes more than 15 minutes, the runbook must include an escalation path. Store runbooks in a repo next to the code.
4. Compensate fairly
Pay a flat weekly stipend for being on-call plus an hourly rate for pages responded to. If a page fires after midnight, the hourly rate doubles. Time-in-lieu is the minimum — cash is better. Never make on-call "part of the job" without extra pay.
5. Guard the culture
Blameless postmortems are non-negotiable. If an alert fires and the runbook is wrong, the person who got paged is never at fault. Track false-positive rate weekly; if it exceeds 30%, declare a "pager freeze" and fix the noise before rotating anyone back in.
TL;DR
- Two tiers, weekly rotation, Monday handoff
- Every alert links to a runbook in the repo
- Flat stipend + hourly pay, double after midnight
- Blameless postmortems, pager freeze above 30% false-positive