Alerting strategy
Build a signal-to-noise pipeline that wakes the right person at the right time — without burning out your team.
Why this matters
Every alert that fires and doesn't require human action trains your team to ignore alerts. The goal is not “alert on everything” — it's “alert only when a human must decide or act.”
Severity tiers
Revenue path down. Wakes on-call immediately via phone + Slack. Requires incident channel.
Degraded but not down. Creates ticket, notifies Slack. Expect response within 30 min.
Informational anomaly. Logged to dashboard. Reviewed weekly. No immediate action.
The pipeline
- Instrument — emit structured metrics from your app (request latency, error rate, queue depth).
- Aggregate — roll up into 1-minute windows. Never alert on a single data point.
- Threshold — define static thresholds for known-good ranges; layer dynamic baselines for seasonal patterns.
- Deduplicate — group alerts by fingerprint (service + error class). One notification per incident, not per event.
- Route — map severity to channel. P0 → PagerDuty. P1 → Slack + linear ticket. P2 → dashboard.
Runbook links
Every alert payload must include a direct link to its runbook. The runbook answers: what does this alert mean, what should I check first, and how do I mitigate if it's real.
Review cadence
Hold a weekly 30-minute alert review. For every alert that fired: was it actionable? If not, tune the threshold or demote the severity. Track your signal-to-noise ratio over time — aim for above 80% actionable.
Next recipe: Incident retro template →