Recipe

LLM Evaluation Framework

Ship prompts with confidence. A systematic approach to measuring, comparing, and improving LLM outputs across your product surface area.

Why evals matter

Prompt tweaks feel productive until you realize you broke three other use cases. A lightweight eval harness catches regressions before they reach users and gives you a repeatable benchmark for every model or prompt change.

The stack

Test cases — JSONL files with input, expected output, and grading rubric per row
Runner — thin Python or Node script that sends each case through your prompt, collects responses
Grader — LLM-as-judge with a structured rubric, or deterministic checks for exact-match fields
Dashboard — pass/fail summary, per-category scores, diff view between runs

Workflow

Define 20–50 representative inputs covering happy path, edge cases, and failure modes.
Write a grading rubric: what makes a 5/5 response vs a 2/5? Be specific.
Run baseline against your current prompt. Record scores.
Make one prompt change. Re-run. Compare deltas.
Gate deploys: if any category drops below threshold, block the release.

Get the template

Clone the Meridian eval harness repo — includes a sample test suite, grading prompts, and a GitHub Actions workflow that runs evals on every PR.

github.com/meridian/llm-evalsMore recipes