Eval dataset design

How Meridian structures evaluation datasets for recipe generation benchmarks.

Dataset anatomy

Each eval record pairs a natural-language prompt with a ground-truth recipe and a set of scoring rubrics. Prompts span ingredient-only, cuisine-targeted, dietary-restricted, and multi-constraint categories to stress-test generation fidelity.

Rubric dimensions

Ingredient coverage — fraction of ground-truth ingredients present in output.
Step ordering — topological sort correctness of preparation steps.
Constraint satisfaction — adherence to dietary, allergy, or cuisine constraints.
Quantitative accuracy — unit and magnitude correctness for each ingredient.

Stratification

Datasets are stratified by cuisine region, difficulty tier, and prompt complexity. Each stratum contains a minimum of 50 records to ensure statistical significance. Held-out strata are reserved for final model release gating.

Versioning

Datasets follow semver. Major bumps indicate rubric changes or ground-truth rewrites. Minor bumps add records without altering existing ones. Patch bumps fix typos or unit normalizations. Every eval run pins an exact dataset version for reproducibility.

← Return to documentation