Experimentation

A/B Testing Prompts + Models

Ship faster with data-driven prompt and model selection. Meridian deterministically buckets users, logs every variant, and surfaces statistically significant winners automatically.

Deterministic Bucketing

Every request is assigned a bucket using a stable hash of the user identifier. The same user always lands in the same bucket, ensuring consistent experiences across sessions.

// Bucket assignment const bucket = hash(user_id) % 100; // Variant routing if (bucket < 50) { prompt_version = "v2_concise"; model = "meridian-7b"; } else { prompt_version = "v1_detailed"; model = "meridian-3b"; }

Variant Logging

Every inference request logs the assigned prompt_version and model alongside the user bucket. These fields are queryable in the dashboard for side-by-side metric comparison.

// Logged per-request fields { "timestamp": "2026-05-26T14:32:01Z", "user_bucket": 42, "prompt_version": "v2_concise", "model": "meridian-7b", "latency_ms": 312, "tokens": 847 }

Comparing Metrics

Group results by variant in the dashboard to compare latency, token usage, user satisfaction scores, and conversion rates. Meridian runs a two-tailed Z-test when sample sizes cross the configured threshold and highlights the winner.

VariantUsersAvg LatencyAvg TokensSatisfaction
v2_concise · meridian-7b12,401284ms61294.2%
v1_detailed · meridian-3b12,389412ms1,03487.1%
Winner:v2_concise + meridian-7b(p < 0.001, +7.1pp satisfaction)

Configuration

Define experiments in the dashboard. Each experiment specifies the traffic split, variant definitions, and the metric used to declare a winner. Changes take effect on the next deployment.

// experiments/chat_summary_v2.json { "id": "chat_summary_v2", "traffic_split": [50, 50], "variants": [ { "prompt_version": "v2_concise", "model": "meridian-7b" }, { "prompt_version": "v1_detailed", "model": "meridian-3b" } ], "success_metric": "user_satisfaction" }

Ready to experiment?

Create your first A/B test in the dashboard and start collecting data on your next deploy.

Create Experiment