Recipe: API rate-limit strategy for LLM apps
Protect your LLM endpoints from abuse while keeping legitimate users unblocked. Token-bucket, sliding-window, and circuit-breaker patterns tuned for high-latency inference workloads.
Token-bucket core
Each API key gets a bucket of tokens that refills at a steady rate. A single LLM request consumes one token. Bursts are absorbed by bucket depth; sustained overage gets a 429.
bucket:
capacity: 60
refill_rate: 2/sec
per: api_keySliding-window accuracy
For billing-tier enforcement, use a sliding-window counter in Redis. Track timestamps per key and evict entries older than the window. This avoids the burst-bleed problem of fixed windows.
window: 60s
max_requests: 100
key: ratelimit:{api_key}:llmCircuit breaker
When your upstream LLM provider returns >50% errors in a 30-second window, trip the circuit. Reject fast with a 503 instead of piling on retries. Half-open after 15 seconds to probe recovery.
Response headers
Always return rate-limit headers so clients can self-throttle:
X-RateLimit-Limit— bucket capacityX-RateLimit-Remaining— tokens leftX-RateLimit-Reset— epoch seconds until full refillRetry-After— seconds to wait on 429
Pro tip: Tier your limits by plan. Free: 10 req/min. Pro: 100 req/min. Enterprise: custom. Store limits in your billing metadata and resolve at the edge.