Back to docs
Recipe

Alerting strategy

Build a signal-to-noise pipeline that wakes the right person at the right time — without burning out your team.

Why this matters

Every alert that fires and doesn't require human action trains your team to ignore alerts. The goal is not “alert on everything” — it's “alert only when a human must decide or act.”

Severity tiers

P0 · Page

Revenue path down. Wakes on-call immediately via phone + Slack. Requires incident channel.

P1 · Ticket

Degraded but not down. Creates ticket, notifies Slack. Expect response within 30 min.

P2 · Log

Informational anomaly. Logged to dashboard. Reviewed weekly. No immediate action.

The pipeline

  1. Instrument — emit structured metrics from your app (request latency, error rate, queue depth).
  2. Aggregate — roll up into 1-minute windows. Never alert on a single data point.
  3. Threshold — define static thresholds for known-good ranges; layer dynamic baselines for seasonal patterns.
  4. Deduplicate — group alerts by fingerprint (service + error class). One notification per incident, not per event.
  5. Route — map severity to channel. P0 → PagerDuty. P1 → Slack + linear ticket. P2 → dashboard.

Runbook links

Every alert payload must include a direct link to its runbook. The runbook answers: what does this alert mean, what should I check first, and how do I mitigate if it's real.

Review cadence

Hold a weekly 30-minute alert review. For every alert that fired: was it actionable? If not, tune the threshold or demote the severity. Track your signal-to-noise ratio over time — aim for above 80% actionable.