Recipe

LLM Quota Design

Quotas are the difference between a healthy LLM gateway and a bankrupt one. This recipe walks through the three-layer quota model Meridian uses to keep costs predictable while still letting tenants burst when they need to.

1. Pick the right unit

Requests-per-minute lies. A single completion can burn 4k tokens or 400k. Bill quotas in tokens-per-minute (TPM) and requests-per-minute (RPM) simultaneously. The smaller bucket wins. For reasoning models, track reasoning_tokens separately since they consume budget before any visible output lands.

2. Layer the limits

Run three concentric buckets: per-key, per-tenant, per-deployment. A noisy key cannot starve its tenant; a noisy tenant cannot starve the deployment. Reject at the smallest bucket and emit a 429 with a Retry-After header so clients back off cleanly.

// Pseudocode — three-layer check
function admit(req) {
  if (!key.tpm.allow(req.tokens))    return deny("key_tpm");
  if (!tenant.tpm.allow(req.tokens)) return deny("tenant_tpm");
  if (!deploy.tpm.allow(req.tokens)) return deny("deploy_tpm");
  return ok();
}

3. Make burst legible

Sliding-window counters in Redis with a 60-second window and a 10-second granularity give tenants room to burst without letting a runaway loop drain a month of credits in an hour. Expose remaining budget as response headers so customers can self-throttle before you have to do it for them.