Recipe

Cost-aware model router

Route prompts to the cheapest capable model using a lightweight classifier.

Overview

Not every prompt needs a frontier model. This recipe deploys a tiny classifier that inspects the prompt and selects the cheapest model that can handle it — saving up to 70% on inference costs without degrading quality.

Tiers

Simple

Haiku / 4o-mini

~$0.25/1M tok

Moderate

Sonnet / 4o

~$3/1M tok

Complex

Opus / o1

~$15/1M tok

Flow

User prompt arrives at the router endpoint.
Classifier scores complexity (0–1) via a distilled BERT variant.
Thresholds map the score to a tier and model.
Prompt is forwarded; response streams back to the client.
Cost and latency metrics are logged for tuning.

Threshold tuning

Start with conservative thresholds (0.3 / 0.7) and adjust based on production evals. Monitor the percentage of prompts hitting each tier and spot-check a sample daily to ensure the classifier is not undershooting complexity.

← Back to docs