Bandit experiments
Multi-armed bandit infrastructure for continuous recipe optimization.
Overview
Meridian runs Thompson-sampling bandits across recipe variants to maximize conversion without manual A/B splits. Each recipe arm accumulates Beta-distributed reward posteriors; the system allocates traffic proportionally to each arm's probability of being best.
Arm lifecycle
- cold-start — uniform prior, equal traffic share
- exploring — posterior variance still high, adaptive allocation
- converged — winner declared, loser arms paused
- pruned — arm removed after sustained underperformance
Reward signals
Rewards are binary (conversion within attribution window) with optional continuous weighting for revenue. Delayed rewards are back-propagated to the arm active at exposure time. Stale rewards beyond the 72-hour window are discarded.
Guardrails
Minimum traffic floor per arm prevents premature convergence. Regret bound monitoring triggers automatic arm retirement if cumulative regret exceeds threshold. All allocations are logged to the experiment audit trail.