Recipe
AI Evaluation Framework
A systematic approach to measuring LLM output quality across accuracy, safety, and alignment dimensions.
Overview
Evaluating AI outputs requires more than spot-checking. This framework defines repeatable metrics, human-in-the-loop scoring rubrics, and automated regression suites that catch drift before it reaches production.
Core Metrics
- Factual Accuracy — groundedness against a verified knowledge base, measured via entailment scoring.
- Instruction Adherence — constraint satisfaction rate across explicit and implicit directives.
- Safety Refusal — correct boundary enforcement on adversarial prompts without over-refusal.
- Tone & Style — brand voice consistency scored via embedding similarity to golden examples.
Pipeline Design
Build a three-stage pipeline: curation (seed prompts + expected outputs), inference (batch evaluation against candidate models), and judgment (LLM-as-judge with calibrated rubrics). Store results in a versioned dataset for trend analysis.
Quick Start
- Define 50–100 seed prompts covering happy-path, edge-case, and adversarial scenarios.
- Annotate ground-truth references and acceptable tolerance bands.
- Run candidate models through the eval harness and collect raw scores.
- Review disagreements manually; refine rubrics iteratively.