Recipe: LLM-as-judge eval pattern
Use a stronger model to score outputs from your fine-tuned model. Run side-by-side comparisons, track win rates, and gate deploys on regression-free scores.
1. Define your rubric
Create a JSON schema with 3–5 axes: correctness, tone, conciseness, safety. Each axis gets a 1–5 scale with anchor descriptions.
2. Build the judge prompt
Feed the judge model the system prompt, user input, and both candidate outputs. Ask it to return structured JSON scores plus a short justification per axis.
3. Run pairwise evals
Compare your new model against the production baseline on a held-out set of 200+ prompts. Record wins, losses, and ties.
4. Gate on thresholds
Only promote if win rate exceeds 52% and no axis drops below the baseline mean. Automate this in CI with a script that calls your judge endpoint.
Tip: Use GPT-4o or Claude 3.5 Sonnet as the judge. Position bias is real — randomize output order and run each pair twice with swapped positions.