Performance

Prompt Caching

Eliminate redundant LLM round-trips by caching responses for identical prompts. A deterministic cache key built from the model name and full message payload ensures zero false hits while slashing latency and token spend.

Cache-Key Strategy

The cache key is a SHA-256 hex digest of JSON.stringify({ model, messages }). Because JSON serialization is deterministic for a given object shape, two identical prompts always produce the same key. The model name is included so that switching from, say, gpt-4o to claude-3.5-sonnet does not return a stale cached reply from the wrong provider.

Client-side helper

// lib/cache-key.ts — deterministic key from model + messages
import { createHash } from "crypto";

export function buildCacheKey(
  model: string,
  messages: { role: string; content: string }[]
): string {
  const payload = JSON.stringify({ model, messages });
  return `prompt:${createHash("sha256").update(payload).digest("hex")}`;
}

Redis cache wrapper

A thin wrapper around Upstash Redis stores the LLM response keyed by the digest. Each entry carries its own TTL so stale data is never served. The wrapper lives in @/lib/redis.

// lib/redis.ts — thin wrapper over Upstash Redis
import { Redis } from "@upstash/redis";

const redis = Redis.fromEnv();

export async function getCachedResponse(key: string) {
  const cached = await redis.get<{ response: string; ttl: number }>(key);
  if (!cached) return null;
  if (Date.now() > cached.ttl) {
    await redis.del(key);
    return null;
  }
  return cached.response;
}

export async function setCachedResponse(
  key: string,
  response: string,
  ttlSeconds: number
) {
  await redis.set(key, { response, ttl: Date.now() + ttlSeconds * 1000 }, { ex: ttlSeconds });
}

API route integration

Wire the cache into your chat endpoint. On a cache hit the response returns instantly with source: cache; on a miss the upstream call is made and the result is stored for subsequent requests.

// app/api/chat/route.ts — cache-aware chat endpoint
import { buildCacheKey } from "@/lib/cache-key";
import { getCachedResponse, setCachedResponse } from "@/lib/redis";

export async function POST(req: Request) {
  const { model, messages } = await req.json();
  const key = buildCacheKey(model, messages);

  const cached = await getCachedResponse(key);
  if (cached) return Response.json({ reply: cached, source: "cache" });

  const reply = await callLLM(model, messages); // your upstream call
  await setCachedResponse(key, reply, 3600);    // 1 h TTL

  return Response.json({ reply, source: "fresh" });
}

When to cache

Deterministic prompts — system messages, few-shot examples, or structured extraction schemas that repeat across requests.
High-traffic endpoints — public-facing chat widgets where many users submit identical or near-identical queries.
Expensive models — caching a single GPT-4o or Claude Opus response can save cents per hit, compounding quickly at scale.

Cache invalidation is handled entirely by TTL. For prompts that depend on real-time data, set a short TTL (e.g. 60 s) or skip caching entirely. The hash-based key guarantees that any change to the prompt — even a single whitespace difference — produces a fresh cache entry.