Recipe: Production log triage
A repeatable workflow for diagnosing incidents from raw logs in under five minutes.
1. Scope the blast radius
Filter by time window (±2 min around the first alert), then by service name. Discard health-check noise and known transient errors before reading a single line.
2. Find the first anomaly
Scan for the earliest 5xx, timeout, or stack trace. That line is your root-cause candidate. Everything after it is often fallout.
3. Trace the request ID
Grab the correlation ID from the failing line and replay every log entry that shares it. Reconstruct the full lifecycle: ingress → auth → handler → upstream → response.
4. Diff against a healthy baseline
Pull a successful request with the same endpoint and method from five minutes earlier. Compare latency, payload size, and dependency calls side by side.
5. Write the one-line summary
Before touching code, commit a single sentence to the incident channel: what broke, when, and the evidence. This forces clarity and prevents premature fixes.
Pro tip: Keep a terminal alias that tails the last 500 lines of your structured log sink. Muscle memory beats dashboards during an incident.