Recipe: Offline model eval pipeline
Run deterministic benchmarks against frozen checkpoints without touching production traffic.
Prerequisites
- Meridian CLI v2.4+ installed and authenticated
- At least one frozen checkpoint in the model registry
- A labeled eval dataset in JSONL or Parquet format
Step 1 — Register the eval dataset
meridian eval register \
--name toxicity-v3 \
--path s3://datasets/toxicity-v3.jsonl \
--metric exact_matchStep 2 — Run the pipeline
meridian eval run \
--checkpoint meridian-7b-instruct@ckpt-142 \
--dataset toxicity-v3 \
--output s3://results/toxicity-ckpt142.jsonThe pipeline spins up an isolated inference pod, streams the dataset through it, and writes scored results.
Step 3 — Inspect results
meridian eval report \
--result s3://results/toxicity-ckpt142.jsonPrints per-category accuracy, latency percentiles, and a diff against the previous checkpoint.
Automation
Trigger the pipeline from CI by calling the Meridian API. A webhook fires when the run completes.
curl -X POST https://api.getnimbus.net/v1/eval/run \
-H "Authorization: Bearer $MERIDIAN_TOKEN" \
-d '{"checkpoint":"meridian-7b-instruct@ckpt-142","dataset":"toxicity-v3"}'