LLM eval set curation
Build a high-signal evaluation dataset that catches regressions before your users do.
Why this matters
A curated eval set is the difference between shipping with confidence and praying nothing broke. Generic benchmarks measure generic performance. Your eval set measures your use case.
Step 1 — Source real prompts
Export the last 500–1000 user prompts from production logs. Strip PII. Keep the distribution intact — do not cherry-pick only hard cases or only easy ones. The eval set must mirror real traffic shape.
Step 2 — Stratify by category
Tag each prompt with a category: factual QA, creative generation, code synthesis, summarization, classification, tool use. Aim for proportional representation. If 40% of production traffic is code synthesis, 40% of your eval set should be too.
Step 3 — Write reference answers
For each prompt, write a concise reference answer that captures the minimum acceptable quality bar. These are not gold-standard essays — they are pass/fail gates. A model either meets the bar or it does not.
Step 4 — Define grading criteria
Per category, define 2–3 binary criteria. Example for code synthesis: (1) compiles without errors, (2) handles the edge case mentioned in the prompt. Binary grading removes ambiguity and makes eval results reproducible.
Step 5 — Automate and gate
Run the eval set on every model change. Fail the pipeline if any category drops below its baseline pass rate. Store results per commit so you can bisect regressions to the exact change that caused them.
Maintenance cadence
Refresh 10–20% of prompts monthly from recent production logs. Deprecate prompts that no longer reflect user behavior. An eval set rots if it is not fed fresh data.
Pro tip: Keep the eval set under 200 prompts for fast iteration. Expand only when you have evidence that the smaller set misses regressions. Speed of feedback matters more than coverage breadth.