LLM Quota Design
Quotas are the difference between a healthy LLM gateway and a bankrupt one. This recipe walks through the three-layer quota model Meridian uses to keep costs predictable while still letting tenants burst when they need to.
1. Pick the right unit
Requests-per-minute lies. A single completion can burn 4k tokens or 400k. Bill quotas in tokens-per-minute (TPM) and requests-per-minute (RPM) simultaneously. The smaller bucket wins. For reasoning models, track reasoning_tokens separately since they consume budget before any visible output lands.
2. Layer the limits
Run three concentric buckets: per-key, per-tenant, per-deployment. A noisy key cannot starve its tenant; a noisy tenant cannot starve the deployment. Reject at the smallest bucket and emit a 429 with a Retry-After header so clients back off cleanly.
// Pseudocode — three-layer check
function admit(req) {
if (!key.tpm.allow(req.tokens)) return deny("key_tpm");
if (!tenant.tpm.allow(req.tokens)) return deny("tenant_tpm");
if (!deploy.tpm.allow(req.tokens)) return deny("deploy_tpm");
return ok();
}3. Make burst legible
Sliding-window counters in Redis with a 60-second window and a 10-second granularity give tenants room to burst without letting a runaway loop drain a month of credits in an hour. Expose remaining budget as response headers so customers can self-throttle before you have to do it for them.