Recipe: Question → best model router
Classify every incoming prompt and route it to the cheapest model that can answer correctly — without sacrificing quality.
Step 1 — Classify the intent
Run a fast, zero-shot classifier (e.g. a fine-tuned BERT variant or a small LLM with a structured prompt) that tags the question with one of: factual, reasoning, creative, or code.
Step 2 — Map intent to model tier
| Intent | Model | Why |
|---|---|---|
| factual | Haiku / 4o-mini | Lowest latency, high accuracy on retrieval |
| reasoning | Sonnet / 4o | Multi-step logic needs depth |
| creative | Opus / 4o | Nuance and tone matter |
| code | Sonnet / 4o | Strong structured output |
Step 3 — Add a fallback
If the classifier confidence is below your threshold (e.g. 0.7), route to your mid-tier model. Never drop a request — degrade gracefully.
Step 4 — Measure and tighten
Log every route decision alongside a human eval or auto-eval score. Shift intent thresholds weekly. The goal: push as much volume as possible to the cheapest model without regressing quality.
Pro tip: Cache identical prompts with a bloom filter so you do not pay for classification twice. At scale, the classifier itself becomes a cost center — treat it like one.