Recipe

AI Evaluation Framework

A systematic approach to measuring LLM output quality across accuracy, safety, and alignment dimensions.

Overview

Evaluating AI outputs requires more than spot-checking. This framework defines repeatable metrics, human-in-the-loop scoring rubrics, and automated regression suites that catch drift before it reaches production.

Core Metrics

Factual Accuracy — groundedness against a verified knowledge base, measured via entailment scoring.
Instruction Adherence — constraint satisfaction rate across explicit and implicit directives.
Safety Refusal — correct boundary enforcement on adversarial prompts without over-refusal.
Tone & Style — brand voice consistency scored via embedding similarity to golden examples.

Pipeline Design

Build a three-stage pipeline: curation (seed prompts + expected outputs), inference (batch evaluation against candidate models), and judgment (LLM-as-judge with calibrated rubrics). Store results in a versioned dataset for trend analysis.

Quick Start

Define 50–100 seed prompts covering happy-path, edge-case, and adversarial scenarios.
Annotate ground-truth references and acceptable tolerance bands.
Run candidate models through the eval harness and collect raw scores.
Review disagreements manually; refine rubrics iteratively.

← Back to Docs Next Recipe →