← Back to Docs
Recipe

LLM Evaluation Framework

Ship prompts with confidence. A systematic approach to measuring, comparing, and improving LLM outputs across your product surface area.

Why evals matter

Prompt tweaks feel productive until you realize you broke three other use cases. A lightweight eval harness catches regressions before they reach users and gives you a repeatable benchmark for every model or prompt change.

The stack

  • Test cases — JSONL files with input, expected output, and grading rubric per row
  • Runner — thin Python or Node script that sends each case through your prompt, collects responses
  • Grader — LLM-as-judge with a structured rubric, or deterministic checks for exact-match fields
  • Dashboard — pass/fail summary, per-category scores, diff view between runs

Workflow

  1. Define 20–50 representative inputs covering happy path, edge cases, and failure modes.
  2. Write a grading rubric: what makes a 5/5 response vs a 2/5? Be specific.
  3. Run baseline against your current prompt. Record scores.
  4. Make one prompt change. Re-run. Compare deltas.
  5. Gate deploys: if any category drops below threshold, block the release.

Get the template

Clone the Meridian eval harness repo — includes a sample test suite, grading prompts, and a GitHub Actions workflow that runs evals on every PR.

github.com/meridian/llm-evalsMore recipes