Model evaluation rubric
A repeatable scoring framework for picking the right Meridian-routed model per task. This rubric trades vibes-based selection for measurable axes: accuracy, latency, cost, and refusal posture. Run it once per workload, then let the router pin the winner.
Define the workload contract
Before scoring, lock the contract. Inputs, expected outputs, hard constraints (token budget, p95 latency, $/1K calls), and the gold-standard reference set. Without a contract the rubric collapses into preference.
- Pin 30 to 100 representative prompts with reviewed gold answers.
- Tag each prompt with difficulty, domain, and refusal-risk.
- Freeze the system prompt across all candidate models.
Score across four axes
Each candidate gets a vector, not a star rating. The router consumes the vector and applies a per-tenant weighting. A model that is cheapest but refuses 8% of prompts is worse than a slower model that never refuses, depending on the workload.
{
"model": "azure/model-router",
"accuracy": 0.94,
"p95_latency_ms": 1820,
"cost_per_1k_usd": 0.42,
"refusal_rate": 0.01,
"weight": { "acc": 0.5, "lat": 0.2, "cost": 0.2, "ref": 0.1 },
"score": 0.871
}Promote, pin, and re-test monthly
Promote the top score to production via a Meridian alias. Pin it for 30 days, then re-run the rubric on the same gold set. Drift is real: a model that wins in March can degrade by May after a provider checkpoint swap. Treat the rubric as a recurring CI job, not a one-time bake-off.
Tip: archive every rubric run as a dated artifact in your repo. When a stakeholder asks why you switched models, the answer is a commit hash, not a memory.