RLHF (Ouyang et al., 2022 — the InstructGPT paper) is a three-stage alignment pipeline that teaches models to produce outputs that humans prefer: helpful, harmless, and honest. It is the origin of the instruction-following behaviour in GPT-4, Claude, and Llama's chat variants.
The Three Stages
Stage 1: SFT (Supervised Fine-Tuning)
Train the base model on high-quality demonstration data (prompt → ideal response pairs). Teaches the model the expected format and domain. Creates the SFT model.
Stage 2: Reward Model Training
Human labellers rank multiple model responses to the same prompt. A separate reward model learns to predict human preference scores. This is the "brain" of the RLHF loop.
Stage 3: PPO (Proximal Policy Optimization)
The SFT model is fine-tuned via RL to maximise the reward model's scores while staying close to the SFT policy (KL divergence penalty). This is the most compute-intensive stage (4 model copies in memory simultaneously).
Why RLHF Is Complex
PPO training for LLMs is notoriously unstable — reward hacking (the model learns to game the reward model rather than be genuinely helpful), mode collapse, and KL divergence tuning all require significant MLops expertise. This is why DPO has largely replaced RLHF for small-team fine-tuning.
DPO vs RLHF Summary
| RLHF | DPO | |
|---|---|---|
| Reward model | Required | Not needed |
| RL training loop | PPO (complex) | Supervised loss (simple) |
| Compute | 4× model copies | 2× model copies |
| Used by frontier labs | OpenAI, Google | Meta (Llama 3+), Mistral |