Semantic cache design
Cut LLM spend and latency by serving near-duplicate prompts from an embedding-indexed cache. This recipe walks through the three load-bearing decisions: similarity threshold, eviction, and invalidation under prompt drift.
1.Embed and key the request
Normalize the prompt (strip whitespace, lowercase ASCII, drop volatile tokens like timestamps), then embed with a small model. The embedding is the cache key. Store the raw normalized prompt alongside so you can verify a hit before returning.
2.Threshold the similarity
Cosine similarity above 0.97 is a safe hit for factual lookups. Drop to 0.93 for chat-style paraphrase tolerance. Below 0.90 you risk serving the wrong answer to a different question. Tune per route.
const hit = await cache.query({
embedding: await embed(prompt),
threshold: 0.97,
ttl: 3600,
});
if (hit) return hit.completion;3.Evict and invalidate
Use TTL plus LRU. For RAG routes, tag entries with the source document hash and bulk-evict when the corpus updates. Never cache personalized completions without a user-scoped key prefix.