Meridian
Recipe

AI Evaluation Framework

A systematic approach to measuring LLM output quality across accuracy, safety, and alignment dimensions.

Overview

Evaluating AI outputs requires more than spot-checking. This framework defines repeatable metrics, human-in-the-loop scoring rubrics, and automated regression suites that catch drift before it reaches production.

Core Metrics

  • Factual Accuracy — groundedness against a verified knowledge base, measured via entailment scoring.
  • Instruction Adherence — constraint satisfaction rate across explicit and implicit directives.
  • Safety Refusal — correct boundary enforcement on adversarial prompts without over-refusal.
  • Tone & Style — brand voice consistency scored via embedding similarity to golden examples.

Pipeline Design

Build a three-stage pipeline: curation (seed prompts + expected outputs), inference (batch evaluation against candidate models), and judgment (LLM-as-judge with calibrated rubrics). Store results in a versioned dataset for trend analysis.

Quick Start

  1. Define 50–100 seed prompts covering happy-path, edge-case, and adversarial scenarios.
  2. Annotate ground-truth references and acceptable tolerance bands.
  3. Run candidate models through the eval harness and collect raw scores.
  4. Review disagreements manually; refine rubrics iteratively.