Guides

Cost optimization checklist

Eight actionable tactics to reduce your LLM spend without sacrificing quality. Apply these patterns across any provider.

1
Pick the smallest model that works
Start with a lightweight model like GPT-4o-mini or Claude Haiku for classification, extraction, and simple Q&A. Reserve GPT-4o or Claude Opus only for tasks that genuinely require deep reasoning. A 10× price gap between tiers means every prompt you move down saves real money.
2
Trim your system prompt
System prompts are charged at the same per-token rate as user messages. Remove filler words, redundant instructions, and examples that the model already understands. A 200-token system prompt vs a 2,000-token one saves 1,800 tokens on every single request.
3
Set max_tokens explicitly
Without a cap, the model can generate far more output than you need. If your UI only displays 300 words, set max_tokens to ~400. You pay for every output token, and unused tokens are pure waste.
4
Cache aggressively
Use prompt caching (Anthropic) or context caching (Google Gemini) for repeated system prompts and long reference documents. Cache hits can reduce cost by 90% on the cached portion. Structure your prompts so the static prefix is cacheable.
5
Batch independent calls
Instead of firing 10 sequential API calls, send them in parallel when there are no data dependencies. Latency stays roughly the same, but you avoid paying for idle time and reduce round-trip overhead.
6
Route by complexity
Build a lightweight classifier that inspects the user query and routes simple requests to a cheap model and complex ones to a premium model. Even a naive keyword-based router can shift 60-80% of traffic to the cheaper tier.
7
Stream and stop early
Stream responses so the frontend can display results as they arrive. If the user navigates away or the answer is clearly complete, abort the stream. You stop paying for tokens the moment the connection drops.
8
Compress conversation history
Long conversations balloon token counts. Summarize older turns into a compact digest and replace the raw history. A 200-token summary of the last 10 messages costs far less than re-sending all 10 messages on every subsequent turn.

Want to track your savings automatically? Explore the docs

Cost optimization checklist

Pick the smallest model that works

Trim your system prompt

Set max_tokens explicitly

Cache aggressively

Batch independent calls

Route by complexity

Stream and stop early

Compress conversation history