Router pattern
Route each request to the smallest model that can handle it. Save ~50% on inference cost without degrading quality.
How it works
1
Classify
A lightweight classifier inspects the prompt and tags it: simple, moderate, or complex.
2
Route
Simple prompts hit a fast model (gpt-4o-mini). Complex ones escalate to a frontier model.
3
Respond
The user gets the same quality output. You pay 50% less because most prompts never touch the expensive model.
Example classifier prompt
You are a prompt classifier. Given a user message, respond
with exactly one word: SIMPLE, MODERATE, or COMPLEX.
SIMPLE — factual lookup, translation, summarization,
grammar fix, short email.
MODERATE — code review, debugging, data analysis,
multi-step reasoning.
COMPLEX — architecture design, research synthesis,
creative writing, long-form content.
User message: "Fix the typo in this sentence."
Response: SIMPLERouting table
| Label | Model | Cost / 1M tokens | Latency |
|---|---|---|---|
| SIMPLE | gpt-4o-mini | $0.15 | ~400ms |
| MODERATE | gpt-4o | $2.50 | ~1.2s |
| COMPLEX | claude-3-opus | $15.00 | ~3.5s |
Cost breakdown
In production workloads, ~70% of prompts classify as SIMPLE, 20% as MODERATE, and 10% as COMPLEX. Routing SIMPLE prompts to gpt-4o-mini instead of a frontier model cuts the average per-request cost in half while keeping latency low for the majority of users.
Implementation sketch
async function routePrompt(userMessage: string) {
const label = await classify(userMessage);
switch (label) {
case "SIMPLE":
return gpt4oMini.chat(userMessage);
case "MODERATE":
return gpt4o.chat(userMessage);
case "COMPLEX":
return claudeOpus.chat(userMessage);
}
}