Recipe
LLM Evaluation Framework
Ship prompts with confidence. A systematic approach to measuring, comparing, and improving LLM outputs across your product surface area.
Why evals matter
Prompt tweaks feel productive until you realize you broke three other use cases. A lightweight eval harness catches regressions before they reach users and gives you a repeatable benchmark for every model or prompt change.
The stack
- Test cases — JSONL files with input, expected output, and grading rubric per row
- Runner — thin Python or Node script that sends each case through your prompt, collects responses
- Grader — LLM-as-judge with a structured rubric, or deterministic checks for exact-match fields
- Dashboard — pass/fail summary, per-category scores, diff view between runs
Workflow
- Define 20–50 representative inputs covering happy path, edge cases, and failure modes.
- Write a grading rubric: what makes a 5/5 response vs a 2/5? Be specific.
- Run baseline against your current prompt. Record scores.
- Make one prompt change. Re-run. Compare deltas.
- Gate deploys: if any category drops below threshold, block the release.
Get the template
Clone the Meridian eval harness repo — includes a sample test suite, grading prompts, and a GitHub Actions workflow that runs evals on every PR.
github.com/meridian/llm-evalsMore recipes