RLHF (Reinforcement Learning from Human Feedback)

Training

The alignment technique used to train ChatGPT-style models by learning from human preferences — combining supervised fine-tuning, reward modelling, and PPO.

RLHF (Ouyang et al., 2022 — the InstructGPT paper) is a three-stage alignment pipeline that teaches models to produce outputs that humans prefer: helpful, harmless, and honest. It is the origin of the instruction-following behaviour in GPT-4, Claude, and Llama's chat variants.

The Three Stages

Stage 1: SFT (Supervised Fine-Tuning)

Train the base model on high-quality demonstration data (prompt → ideal response pairs). Teaches the model the expected format and domain. Creates the SFT model.

Stage 2: Reward Model Training

Human labellers rank multiple model responses to the same prompt. A separate reward model learns to predict human preference scores. This is the "brain" of the RLHF loop.

Stage 3: PPO (Proximal Policy Optimization)

The SFT model is fine-tuned via RL to maximise the reward model's scores while staying close to the SFT policy (KL divergence penalty). This is the most compute-intensive stage (4 model copies in memory simultaneously).

Why RLHF Is Complex

PPO training for LLMs is notoriously unstable — reward hacking (the model learns to game the reward model rather than be genuinely helpful), mode collapse, and KL divergence tuning all require significant MLops expertise. This is why DPO has largely replaced RLHF for small-team fine-tuning.

DPO vs RLHF Summary

RLHFDPO
Reward modelRequiredNot needed
RL training loopPPO (complex)Supervised loss (simple)
Compute4× model copies2× model copies
Used by frontier labsOpenAI, GoogleMeta (Llama 3+), Mistral