DPO Primer
Direct Preference Optimization replaces RLHF with a single-stage fine-tuning pass. No reward model, no PPO — just a clean pairwise loss over chosen and rejected completions.
Core Formula
L_DPO = -E[ log σ( β log(π_θ(y_w|x)/π_ref(y_w|x))
- β log(π_θ(y_l|x)/π_ref(y_l|x)) ) ]Why DPO
RLHF pipelines are brittle. You train a reward model on human preferences, then use PPO to align the policy — two moving targets, lots of hyperparameters, and reward hacking is constant. DPO collapses the whole thing into a binary cross-entropy objective directly on the preference pairs.
What You Need
- A base model you want to align (π_ref)
- A dataset of (prompt, chosen, rejected) triples
- A β parameter controlling divergence from the reference
The Intuition
DPO increases the relative log-probability of chosen responses versus rejected ones, scaled by β. The reference model acts as a regularizer — without it, the policy would collapse to mode-seeking. Higher β keeps you closer to the base distribution; lower β lets preferences dominate.
Practical Notes
β in the 0.1–0.5 range works well for most 7B-class models. Preference pairs should be hard — easy distinctions don't move the gradient. If your loss plateaus immediately, your reference and policy are too close; try a higher learning rate or lower β.
Next: Build the training loop with the DPO Trainer recipe.