Recipe: Offline model eval pipeline

Run deterministic benchmarks against frozen checkpoints without touching production traffic.

Prerequisites

Meridian CLI v2.4+ installed and authenticated
At least one frozen checkpoint in the model registry
A labeled eval dataset in JSONL or Parquet format

Step 1 — Register the eval dataset

meridian eval register \
  --name toxicity-v3 \
  --path s3://datasets/toxicity-v3.jsonl \
  --metric exact_match

Step 2 — Run the pipeline

meridian eval run \
  --checkpoint meridian-7b-instruct@ckpt-142 \
  --dataset toxicity-v3 \
  --output s3://results/toxicity-ckpt142.json

The pipeline spins up an isolated inference pod, streams the dataset through it, and writes scored results.

Step 3 — Inspect results

meridian eval report \
  --result s3://results/toxicity-ckpt142.json

Prints per-category accuracy, latency percentiles, and a diff against the previous checkpoint.

Automation

Trigger the pipeline from CI by calling the Meridian API. A webhook fires when the run completes.

curl -X POST https://api.getnimbus.net/v1/eval/run \
  -H "Authorization: Bearer $MERIDIAN_TOKEN" \
  -d '{"checkpoint":"meridian-7b-instruct@ckpt-142","dataset":"toxicity-v3"}'

Next: Recipe: Shadow deploy to production