Recipe

Recipe: LLM-as-judge eval pattern

Use a stronger model to score outputs from your fine-tuned model. Run side-by-side comparisons, track win rates, and gate deploys on regression-free scores.

1. Define your rubric

Create a JSON schema with 3–5 axes: correctness, tone, conciseness, safety. Each axis gets a 1–5 scale with anchor descriptions.

2. Build the judge prompt

Feed the judge model the system prompt, user input, and both candidate outputs. Ask it to return structured JSON scores plus a short justification per axis.

3. Run pairwise evals

Compare your new model against the production baseline on a held-out set of 200+ prompts. Record wins, losses, and ties.

4. Gate on thresholds

Only promote if win rate exceeds 52% and no axis drops below the baseline mean. Automate this in CI with a script that calls your judge endpoint.

Tip: Use GPT-4o or Claude 3.5 Sonnet as the judge. Position bias is real — randomize output order and run each pair twice with swapped positions.