Recipe

Recipe: Prompt caching strategy

Reduce latency and token spend by structuring prompts so the model reuses cached prefix computations across requests.

Why it matters

When you send identical prompt prefixes, the inference engine can cache the key-value tensors and skip recomputation. This cuts time-to-first-token by up to 80% and lowers per-request cost.

The pattern

Static prefix — system instructions, tool definitions, few-shot examples go first and never change.
Dynamic suffix — user query, context documents, or variable data appended at the end.
Cache breakpoints — mark boundaries explicitly so the runtime knows where reuse stops.

Example structure

[system] You are a code reviewer.
[tools] read_file, grep, lint
[examples] 3 few-shot pairs
--- cache boundary ---
[user] Review this PR diff:
<diff>...</diff>

Key metrics

Cache hit rate target: >90% on prefix tokens
TTFT reduction: 60–80% on cached requests
Cost savings: proportional to prefix/total token ratio

← Browse all recipes