LLM response caching
LLM calls are slow and expensive. Caching identical prompts collapses tail latency from seconds to milliseconds and turns a per-call cost into a one-time write. This recipe shows how to wrap the Meridian gateway with a deterministic cache keyed on the normalized request body, with a fallback to live inference on miss.
1. Normalize the request
Hash the model id, messages array, temperature, and tool list into a single SHA-256. Strip request ids, timestamps, and trace headers so semantically identical calls collide. Treat temperature greater than zero as uncacheable unless the caller opts in with a seed.
2. Read-through with a short TTL
Use Redis or Upstash KV with a 24-hour TTL. On hit, return the cached completion with a header marking the cache layer. On miss, fall through to the model, await the response, and write it back before returning. Streaming responses are buffered in memory and flushed atomically on completion to avoid poisoning the cache with partial output.
3. Wire the gateway middleware
The cache layer plugs in as middleware ahead of the upstream router. The example below shows the minimal handler used in production at llm.getnimbus.net.
import { createHash } from 'crypto';
import { kv } from '@vercel/kv';
export async function cachedCompletion(req) {
const key = createHash('sha256')
.update(JSON.stringify(req))
.digest('hex');
const hit = await kv.get(key);
if (hit) return { ...hit, cache: 'HIT' };
const fresh = await callUpstream(req);
await kv.set(key, fresh, { ex: 86400 });
return { ...fresh, cache: 'MISS' };
}