Model Router Design Patterns
A model router sits in front of N upstream LLMs and decides which one to call per request. Done right, it cuts cost 40-70%, improves p50 latency, and makes vendor outages invisible. This recipe covers the three patterns Meridian uses in production.
1. Static tier routing
The simplest pattern. Classify each incoming prompt by token budget and required capability, then route to a fixed tier: cheap, balanced, or frontier. Works well when traffic is predictable and you have clear SLAs per use case.
Best for: bulk classification, summarization, low-stakes generation.
2. Adaptive scoring
Score every model on a rolling window of latency, error rate, and per-token cost. Pick the highest-scoring model that meets the request's capability floor. Re-score every 60 seconds. This pattern handles vendor degradations gracefully without operator intervention.
Best for: mixed-workload APIs with strict p99 latency requirements.
3. Speculative cascade
Send the prompt to a cheap model first. If the response passes a quality gate (length, format, confidence), return it. Otherwise escalate to a frontier model. Reduces average cost dramatically when 60%+ of prompts are answerable by the cheap tier.
Best for: agentic loops where most steps are trivial.
Example: tier router
async function route(req) {
const tier = classify(req.prompt);
const candidates = MODELS[tier];
for (const m of rankByScore(candidates)) {
try {
return await call(m, req);
} catch (e) {
if (!isRetryable(e)) throw e;
}
}
throw new Error('all upstreams failed');
}