Recipe: LLM token budget controller
Keep reasoning models from burning your entire context window before they emit a single line.
The problem
Reasoning models (gpt-5, Kimi-K2.6, o4-mini) can spend 80% of your token budget on internal chain-of-thought. Your visible answer arrives truncated or empty. You pay for tokens you never see.
The fix
Set max_tokens to cap total spend. For reasoning models, also set reasoning_effort to low or medium. Budget your visible output separately from thinking tokens.
Budget formula
visible_budget = total_budget - thinking_overhead thinking_overhead = total_budget * 0.6 // cap at 60% max_tokens = visible_budget
For a 4096 token window, reserve 1638 for thinking, 2458 for output.
System prompt snippet
If you are a reasoning model, keep internal reasoning short. The caller has set a token budget — if you spend it all thinking, your visible answer will be empty or truncated. Prefer a clean final answer over an exhaustive chain of thought.
Circuit breaker
Wrap every LLM call in a token counter. If the model exceeds 90% of budget without emitting a stop sequence, abort the stream and return a partial. Never ship an empty response to the user.