Recipe
Recipe: Prompt caching strategy
Reduce latency and token spend by structuring prompts so the model reuses cached prefix computations across requests.
Why it matters
When you send identical prompt prefixes, the inference engine can cache the key-value tensors and skip recomputation. This cuts time-to-first-token by up to 80% and lowers per-request cost.
The pattern
- Static prefix — system instructions, tool definitions, few-shot examples go first and never change.
- Dynamic suffix — user query, context documents, or variable data appended at the end.
- Cache breakpoints — mark boundaries explicitly so the runtime knows where reuse stops.
Example structure
[system] You are a code reviewer.
[tools] read_file, grep, lint
[examples] 3 few-shot pairs
--- cache boundary ---
[user] Review this PR diff:
<diff>...</diff>Key metrics
- Cache hit rate target: >90% on prefix tokens
- TTFT reduction: 60–80% on cached requests
- Cost savings: proportional to prefix/total token ratio