Recipe: LLM token budget controller

Keep reasoning models from burning your entire context window before they emit a single line.

The problem

Reasoning models (gpt-5, Kimi-K2.6, o4-mini) can spend 80% of your token budget on internal chain-of-thought. Your visible answer arrives truncated or empty. You pay for tokens you never see.

The fix

Set max_tokens to cap total spend. For reasoning models, also set reasoning_effort to low or medium. Budget your visible output separately from thinking tokens.

Budget formula

visible_budget = total_budget - thinking_overhead
thinking_overhead = total_budget * 0.6  // cap at 60%
max_tokens = visible_budget

For a 4096 token window, reserve 1638 for thinking, 2458 for output.

System prompt snippet

If you are a reasoning model, keep internal
reasoning short. The caller has set a token
budget — if you spend it all thinking, your
visible answer will be empty or truncated.
Prefer a clean final answer over an
exhaustive chain of thought.

Circuit breaker

Wrap every LLM call in a token counter. If the model exceeds 90% of budget without emitting a stop sequence, abort the stream and return a partial. Never ship an empty response to the user.