Bandit experiments

Multi-armed bandit infrastructure for continuous recipe optimization.

Overview

Meridian runs Thompson-sampling bandits across recipe variants to maximize conversion without manual A/B splits. Each recipe arm accumulates Beta-distributed reward posteriors; the system allocates traffic proportionally to each arm's probability of being best.

Arm lifecycle

cold-start — uniform prior, equal traffic share
exploring — posterior variance still high, adaptive allocation
converged — winner declared, loser arms paused
pruned — arm removed after sustained underperformance

Reward signals

Rewards are binary (conversion within attribution window) with optional continuous weighting for revenue. Delayed rewards are back-propagated to the arm active at exposure time. Stale rewards beyond the 72-hour window are discarded.

Guardrails

Minimum traffic floor per arm prevents premature convergence. Regret bound monitoring triggers automatic arm retirement if cumulative regret exceeds threshold. All allocations are logged to the experiment audit trail.

← Back to docs