Guides

Cost optimization checklist

Eight actionable tactics to reduce your LLM spend without sacrificing quality. Apply these patterns across any provider.

  1. 1

    Pick the smallest model that works

    Start with a lightweight model like GPT-4o-mini or Claude Haiku for classification, extraction, and simple Q&A. Reserve GPT-4o or Claude Opus only for tasks that genuinely require deep reasoning. A 10× price gap between tiers means every prompt you move down saves real money.

  2. 2

    Trim your system prompt

    System prompts are charged at the same per-token rate as user messages. Remove filler words, redundant instructions, and examples that the model already understands. A 200-token system prompt vs a 2,000-token one saves 1,800 tokens on every single request.

  3. 3

    Set max_tokens explicitly

    Without a cap, the model can generate far more output than you need. If your UI only displays 300 words, set max_tokens to ~400. You pay for every output token, and unused tokens are pure waste.

  4. 4

    Cache aggressively

    Use prompt caching (Anthropic) or context caching (Google Gemini) for repeated system prompts and long reference documents. Cache hits can reduce cost by 90% on the cached portion. Structure your prompts so the static prefix is cacheable.

  5. 5

    Batch independent calls

    Instead of firing 10 sequential API calls, send them in parallel when there are no data dependencies. Latency stays roughly the same, but you avoid paying for idle time and reduce round-trip overhead.

  6. 6

    Route by complexity

    Build a lightweight classifier that inspects the user query and routes simple requests to a cheap model and complex ones to a premium model. Even a naive keyword-based router can shift 60-80% of traffic to the cheaper tier.

  7. 7

    Stream and stop early

    Stream responses so the frontend can display results as they arrive. If the user navigates away or the answer is clearly complete, abort the stream. You stop paying for tokens the moment the connection drops.

  8. 8

    Compress conversation history

    Long conversations balloon token counts. Summarize older turns into a compact digest and replace the raw history. A 200-token summary of the last 10 messages costs far less than re-sending all 10 messages on every subsequent turn.

Want to track your savings automatically? Explore the docs