← Back to docs

Token budget design

Every Meridian request rides on a finite token envelope. A disciplined budget keeps latency predictable, costs bounded, and quality high. This recipe shows how to carve up the envelope between system prompt, retrieved context, conversation history, and model output so nothing gets silently truncated at the edge.

1. Reserve output first

Pin max_tokens before you spend a single token on input. Reasoning models burn hidden chain-of-thought against this same ceiling, so set at least 2048 for any model whose id starts with gpt-5, o4, or grok-reasoning.

2. Tier the input

Split the remainder into three tiers: system instructions (fixed), retrieved context (elastic), and conversation history (compressible). When you run out of room, summarize history before you truncate retrieval — users notice missing facts faster than they notice forgotten turns.

3. Measure with the router

Meridian's azure/model-routerreturns prompt and completion counts on every response. Log both, then alert when the p95 prompt size crosses 80% of the model's window — that is the point where truncation starts shaping answers without warning.

const budget = {
  model_window: 128000,
  reserved_output: 4096,
  system_prompt: 800,
  retrieval_max: 64000,
  history_max: 128000 - 4096 - 800 - 64000,
};

// Always: sum(parts) + reserved_output <= model_window
Back to all recipes