← Back to DocsRecipe

DPO Primer

Direct Preference Optimization replaces RLHF with a single-stage fine-tuning pass. No reward model, no PPO — just a clean pairwise loss over chosen and rejected completions.

Core Formula

L_DPO = -E[ log σ( β log(π_θ(y_w|x)/π_ref(y_w|x))
         - β log(π_θ(y_l|x)/π_ref(y_l|x)) ) ]

Why DPO

RLHF pipelines are brittle. You train a reward model on human preferences, then use PPO to align the policy — two moving targets, lots of hyperparameters, and reward hacking is constant. DPO collapses the whole thing into a binary cross-entropy objective directly on the preference pairs.

What You Need

A base model you want to align (π_ref)
A dataset of (prompt, chosen, rejected) triples
A β parameter controlling divergence from the reference

The Intuition

DPO increases the relative log-probability of chosen responses versus rejected ones, scaled by β. The reference model acts as a regularizer — without it, the policy would collapse to mode-seeking. Higher β keeps you closer to the base distribution; lower β lets preferences dominate.

Practical Notes

β in the 0.1–0.5 range works well for most 7B-class models. Preference pairs should be hard — easy distinctions don't move the gradient. If your loss plateaus immediately, your reference and policy are too close; try a higher learning rate or lower β.

Next: Build the training loop with the DPO Trainer recipe.