Recipe

Recipe: API rate-limit strategy for LLM apps

Protect your LLM endpoints from abuse while keeping legitimate users unblocked. Token-bucket, sliding-window, and circuit-breaker patterns tuned for high-latency inference workloads.

Token-bucket core

Each API key gets a bucket of tokens that refills at a steady rate. A single LLM request consumes one token. Bursts are absorbed by bucket depth; sustained overage gets a 429.

bucket:
  capacity: 60
  refill_rate: 2/sec
  per: api_key

Sliding-window accuracy

For billing-tier enforcement, use a sliding-window counter in Redis. Track timestamps per key and evict entries older than the window. This avoids the burst-bleed problem of fixed windows.

window: 60s
max_requests: 100
key: ratelimit:{api_key}:llm

Circuit breaker

When your upstream LLM provider returns >50% errors in a 30-second window, trip the circuit. Reject fast with a 503 instead of piling on retries. Half-open after 15 seconds to probe recovery.

Response headers

Always return rate-limit headers so clients can self-throttle:

X-RateLimit-Limit — bucket capacity
X-RateLimit-Remaining — tokens left
X-RateLimit-Reset — epoch seconds until full refill
Retry-After — seconds to wait on 429

Pro tip: Tier your limits by plan. Free: 10 req/min. Pro: 100 req/min. Enterprise: custom. Store limits in your billing metadata and resolve at the edge.