Recipe: Production log triage

A repeatable workflow for diagnosing incidents from raw logs in under five minutes.

1. Scope the blast radius

Filter by time window (±2 min around the first alert), then by service name. Discard health-check noise and known transient errors before reading a single line.

2. Find the first anomaly

Scan for the earliest 5xx, timeout, or stack trace. That line is your root-cause candidate. Everything after it is often fallout.

3. Trace the request ID

Grab the correlation ID from the failing line and replay every log entry that shares it. Reconstruct the full lifecycle: ingress → auth → handler → upstream → response.

4. Diff against a healthy baseline

Pull a successful request with the same endpoint and method from five minutes earlier. Compare latency, payload size, and dependency calls side by side.

5. Write the one-line summary

Before touching code, commit a single sentence to the incident channel: what broke, when, and the evidence. This forces clarity and prevents premature fixes.

Pro tip: Keep a terminal alias that tails the last 500 lines of your structured log sink. Muscle memory beats dashboards during an incident.