Recipe
Cost-aware model router
Route prompts to the cheapest capable model using a lightweight classifier.
Overview
Not every prompt needs a frontier model. This recipe deploys a tiny classifier that inspects the prompt and selects the cheapest model that can handle it — saving up to 70% on inference costs without degrading quality.
Tiers
Simple
Haiku / 4o-mini
~$0.25/1M tok
Moderate
Sonnet / 4o
~$3/1M tok
Complex
Opus / o1
~$15/1M tok
Flow
- User prompt arrives at the router endpoint.
- Classifier scores complexity (0–1) via a distilled BERT variant.
- Thresholds map the score to a tier and model.
- Prompt is forwarded; response streams back to the client.
- Cost and latency metrics are logged for tuning.
Threshold tuning
Start with conservative thresholds (0.3 / 0.7) and adjust based on production evals. Monitor the percentage of prompts hitting each tier and spot-check a sample daily to ensure the classifier is not undershooting complexity.