← Back to Docs
Recipe

RLHF Primer

Reinforcement Learning from Human Feedback — the core loop that aligns raw model outputs with human preferences.

The Three-Act Structure

RLHF operates in three distinct phases. First, a base language model is fine-tuned with supervised data (SFT). Second, human labelers rank multiple model outputs, training a reward model that predicts preference scores. Third, the SFT model is optimized via PPO to maximize the reward model's score while staying close to its original distribution via a KL penalty.

Reward Model Architecture

The reward model shares the same transformer backbone as the policy. Its final embedding is projected through a linear head to a scalar score. Training uses pairwise comparison loss — given a prompt and two completions, the model learns to assign higher scores to the human-preferred response. Bradley-Terry modeling underpins this: the probability that completion A beats completion B is a sigmoid of the score difference.

PPO with KL Constraint

Proximal Policy Optimization updates the policy to maximize reward while penalizing divergence from the SFT checkpoint. The KL penalty coefficient is tuned dynamically — if KL drifts above a target threshold, the coefficient increases to pull the policy back. This prevents reward hacking, where the model exploits quirks in the reward model rather than producing genuinely useful text.

Practical Considerations

Labeler agreement is noisy — inter-annotator agreement often hovers around 70%. Mitigate this by collecting multiple rankings per prompt and averaging. Reward model overoptimization sets in after roughly 1,000–2,000 PPO steps in typical setups. Monitor held-out validation reward to detect divergence early. For production, consider Direct Preference Optimization (DPO) as a simpler alternative that skips the explicit reward model entirely.

Key Takeaway

RLHF is not a one-shot process. It is a flywheel — better reward models enable better policies, which surface harder cases for human labelers, which improve the reward model further. The bottleneck is always high-quality human feedback at scale.