RLHF (Reinforcement Learning from Human Feedback) – LLM Glossary

RLHF (Ouyang et al., 2022 — the InstructGPT paper) is a three-stage alignment pipeline that teaches models to produce outputs that humans prefer: helpful, harmless, and honest. It is the origin of the instruction-following behaviour in GPT-4, Claude, and Llama's chat variants.

The Three Stages

Stage 1: SFT (Supervised Fine-Tuning)

Train the base model on high-quality demonstration data (prompt → ideal response pairs). Teaches the model the expected format and domain. Creates the SFT model.

Stage 2: Reward Model Training

Human labellers rank multiple model responses to the same prompt. A separate reward model learns to predict human preference scores. This is the "brain" of the RLHF loop.

Stage 3: PPO (Proximal Policy Optimization)

The SFT model is fine-tuned via RL to maximise the reward model's scores while staying close to the SFT policy (KL divergence penalty). This is the most compute-intensive stage (4 model copies in memory simultaneously).

Why RLHF Is Complex

PPO training for LLMs is notoriously unstable — reward hacking (the model learns to game the reward model rather than be genuinely helpful), mode collapse, and KL divergence tuning all require significant MLops expertise. This is why DPO has largely replaced RLHF for small-team fine-tuning.

DPO vs RLHF Summary

	RLHF	DPO
Reward model	Required	Not needed
RL training loop	PPO (complex)	Supervised loss (simple)
Compute	4× model copies	2× model copies
Used by frontier labs	OpenAI, Google	Meta (Llama 3+), Mistral