Recipe

Cost-aware model router

Route prompts to the cheapest capable model using a lightweight classifier.

Overview

Not every prompt needs a frontier model. This recipe deploys a tiny classifier that inspects the prompt and selects the cheapest model that can handle it — saving up to 70% on inference costs without degrading quality.

Tiers

Simple
Haiku / 4o-mini
~$0.25/1M tok
Moderate
Sonnet / 4o
~$3/1M tok
Complex
Opus / o1
~$15/1M tok

Flow

  1. User prompt arrives at the router endpoint.
  2. Classifier scores complexity (0–1) via a distilled BERT variant.
  3. Thresholds map the score to a tier and model.
  4. Prompt is forwarded; response streams back to the client.
  5. Cost and latency metrics are logged for tuning.

Threshold tuning

Start with conservative thresholds (0.3 / 0.7) and adjust based on production evals. Monitor the percentage of prompts hitting each tier and spot-check a sample daily to ensure the classifier is not undershooting complexity.