Recipe

Semantic cache design

Cut LLM spend and latency by serving near-duplicate prompts from an embedding-indexed cache. This recipe walks through the three load-bearing decisions: similarity threshold, eviction, and invalidation under prompt drift.

1.Embed and key the request

Normalize the prompt (strip whitespace, lowercase ASCII, drop volatile tokens like timestamps), then embed with a small model. The embedding is the cache key. Store the raw normalized prompt alongside so you can verify a hit before returning.

2.Threshold the similarity

Cosine similarity above 0.97 is a safe hit for factual lookups. Drop to 0.93 for chat-style paraphrase tolerance. Below 0.90 you risk serving the wrong answer to a different question. Tune per route.

const hit = await cache.query({
  embedding: await embed(prompt),
  threshold: 0.97,
  ttl: 3600,
});
if (hit) return hit.completion;

3.Evict and invalidate

Use TTL plus LRU. For RAG routes, tag entries with the source document hash and bulk-evict when the corpus updates. Never cache personalized completions without a user-scoped key prefix.