Router pattern

Route each request to the smallest model that can handle it. Save ~50% on inference cost without degrading quality.

How it works

1

Classify

A lightweight classifier inspects the prompt and tags it: simple, moderate, or complex.

2

Route

Simple prompts hit a fast model (gpt-4o-mini). Complex ones escalate to a frontier model.

3

Respond

The user gets the same quality output. You pay 50% less because most prompts never touch the expensive model.

Example classifier prompt

You are a prompt classifier. Given a user message, respond
with exactly one word: SIMPLE, MODERATE, or COMPLEX.

SIMPLE — factual lookup, translation, summarization,
  grammar fix, short email.
MODERATE — code review, debugging, data analysis,
  multi-step reasoning.
COMPLEX — architecture design, research synthesis,
  creative writing, long-form content.

User message: "Fix the typo in this sentence."
Response: SIMPLE

Routing table

LabelModelCost / 1M tokensLatency
SIMPLEgpt-4o-mini$0.15~400ms
MODERATEgpt-4o$2.50~1.2s
COMPLEXclaude-3-opus$15.00~3.5s

Cost breakdown

In production workloads, ~70% of prompts classify as SIMPLE, 20% as MODERATE, and 10% as COMPLEX. Routing SIMPLE prompts to gpt-4o-mini instead of a frontier model cuts the average per-request cost in half while keeping latency low for the majority of users.

Implementation sketch

async function routePrompt(userMessage: string) {
  const label = await classify(userMessage);
  switch (label) {
    case "SIMPLE":
      return gpt4oMini.chat(userMessage);
    case "MODERATE":
      return gpt4o.chat(userMessage);
    case "COMPLEX":
      return claudeOpus.chat(userMessage);
  }
}