Eval dataset design
How Meridian structures evaluation datasets for recipe generation benchmarks.
Dataset anatomy
Each eval record pairs a natural-language prompt with a ground-truth recipe and a set of scoring rubrics. Prompts span ingredient-only, cuisine-targeted, dietary-restricted, and multi-constraint categories to stress-test generation fidelity.
Rubric dimensions
- Ingredient coverage — fraction of ground-truth ingredients present in output.
- Step ordering — topological sort correctness of preparation steps.
- Constraint satisfaction — adherence to dietary, allergy, or cuisine constraints.
- Quantitative accuracy — unit and magnitude correctness for each ingredient.
Stratification
Datasets are stratified by cuisine region, difficulty tier, and prompt complexity. Each stratum contains a minimum of 50 records to ensure statistical significance. Held-out strata are reserved for final model release gating.
Versioning
Datasets follow semver. Major bumps indicate rubric changes or ground-truth rewrites. Minor bumps add records without altering existing ones. Patch bumps fix typos or unit normalizations. Every eval run pins an exact dataset version for reproducibility.