Recipe

LLM Drift Detection

Production LLM outputs degrade silently. A model that scored 94% on your eval suite last month may quietly slip to 78% after a provider-side weight update, a prompt template tweak, or a distribution shift in user inputs. This recipe wires Meridian into your inference pipeline so drift is caught in hours, not in the next quarterly review.

1. Capture a baseline distribution

Before you can detect drift, you need a fixed reference. Sample 500 to 2000 production prompts from a known-good week, then score each completion with your eval rubric. Persist the prompts, completions, and scores as your baseline manifest. Meridian stores this as a versioned artifact so you can re-run the same eval against any future model snapshot.

from meridian import Baseline, score

baseline = Baseline.create(
    name="support-bot-v3",
    model="azure/model-router",
    prompts=load_prompts("week_of_2026_06_01.jsonl"),
    rubric="rubrics/helpfulness_v2.yaml",
)
baseline.freeze()

2. Continuously re-score against the baseline

Schedule a recurring job that replays the baseline prompts against the live model and compares distributional metrics: mean rubric score, score variance, refusal rate, and average completion length. Meridian computes a Kolmogorov-Smirnov statistic between the baseline and live score distributions and surfaces the p-value on your dashboard. A drop in mean score larger than the baseline standard deviation, or a KS p-value below 0.01, fires an alert.

3. Triage and rollback

When an alert fires, the Meridian console shows you the 20 prompts with the largest score regressions, side-by-side baseline and live completions, and the rubric judge's reasoning for each. From there you either roll back to a pinned model snapshot, adjust your prompt template, or accept the new behavior and refreeze the baseline. The whole loop runs without a human in the path until an alert fires.

Ready to wire this up? See the Quickstart guide or jump to the Baselines API reference.