Prompt Caching
Eliminate redundant LLM round-trips by caching responses for identical prompts. A deterministic cache key built from the model name and full message payload ensures zero false hits while slashing latency and token spend.
Cache-Key Strategy
The cache key is a SHA-256 hex digest of JSON.stringify({ model, messages }). Because JSON serialization is deterministic for a given object shape, two identical prompts always produce the same key. The model name is included so that switching from, say, gpt-4o to claude-3.5-sonnet does not return a stale cached reply from the wrong provider.
Client-side helper
// lib/cache-key.ts — deterministic key from model + messages
import { createHash } from "crypto";
export function buildCacheKey(
model: string,
messages: { role: string; content: string }[]
): string {
const payload = JSON.stringify({ model, messages });
return `prompt:${createHash("sha256").update(payload).digest("hex")}`;
}Redis cache wrapper
A thin wrapper around Upstash Redis stores the LLM response keyed by the digest. Each entry carries its own TTL so stale data is never served. The wrapper lives in @/lib/redis.
// lib/redis.ts — thin wrapper over Upstash Redis
import { Redis } from "@upstash/redis";
const redis = Redis.fromEnv();
export async function getCachedResponse(key: string) {
const cached = await redis.get<{ response: string; ttl: number }>(key);
if (!cached) return null;
if (Date.now() > cached.ttl) {
await redis.del(key);
return null;
}
return cached.response;
}
export async function setCachedResponse(
key: string,
response: string,
ttlSeconds: number
) {
await redis.set(key, { response, ttl: Date.now() + ttlSeconds * 1000 }, { ex: ttlSeconds });
}API route integration
Wire the cache into your chat endpoint. On a cache hit the response returns instantly with source: cache; on a miss the upstream call is made and the result is stored for subsequent requests.
// app/api/chat/route.ts — cache-aware chat endpoint
import { buildCacheKey } from "@/lib/cache-key";
import { getCachedResponse, setCachedResponse } from "@/lib/redis";
export async function POST(req: Request) {
const { model, messages } = await req.json();
const key = buildCacheKey(model, messages);
const cached = await getCachedResponse(key);
if (cached) return Response.json({ reply: cached, source: "cache" });
const reply = await callLLM(model, messages); // your upstream call
await setCachedResponse(key, reply, 3600); // 1 h TTL
return Response.json({ reply, source: "fresh" });
}When to cache
- Deterministic prompts — system messages, few-shot examples, or structured extraction schemas that repeat across requests.
- High-traffic endpoints — public-facing chat widgets where many users submit identical or near-identical queries.
- Expensive models — caching a single GPT-4o or Claude Opus response can save cents per hit, compounding quickly at scale.
Cache invalidation is handled entirely by TTL. For prompts that depend on real-time data, set a short TTL (e.g. 60 s) or skip caching entirely. The hash-based key guarantees that any change to the prompt — even a single whitespace difference — produces a fresh cache entry.