Metric naming + label taxonomy
A consistent schema for naming metrics and structuring labels so dashboards stay readable, queries stay fast, and on-call engineers stop guessing.
Metric name structure
Every metric follows the pattern <domain>_<noun>_<unit>. Domains are short prefixes like http, db, or queue. Nouns describe what is measured (requests, latency, errors). Units are always plural and unabbreviated: seconds not sec.
Label cardinality budget
Labels are expensive. Cap total label combinations per metric at 10,000. Never put user IDs, request IDs, or unbounded strings in labels. Use structured log attributes for high-cardinality dimensions and link back via trace ID.
Standard label set
status—success | error | degradederror_type—timeout | ratelimit | internal | upstreammethod— HTTP verb or RPC name, lowercasedendpoint— normalized path with parameter placeholders
Histogram bucket policy
Use explicit buckets tuned to your SLO, not default Prometheus buckets. For latency SLOs at 100ms, buckets should cluster around 50ms–200ms with a long tail for outliers. Document bucket choices in a comment above the metric definition.
Naming anti-patterns
- No dots in metric names — use underscores
- No units baked into the name suffix when a unit label exists
- No
_totalsuffix on counters — Prometheus appends it - No service name in the metric — that is a label
Next recipe: Alert routing by severity