DPO (Rafailov et al., 2023) reformulates alignment as a supervised learning problem. Instead of training a separate reward model and running PPO, you directly optimise the LLM policy to prefer "chosen" responses over "rejected" ones using a closed-form loss.
How It Works
The DPO loss function implicitly defines a reward using the log-probability ratio between the trained model and a frozen reference model. The model is pushed to increase the probability of preferred completions and decrease the probability of rejected ones — in a single forward pass, no RL infrastructure required.
DPO vs RLHF
| Dimension | RLHF (PPO) | DPO |
|---|---|---|
| Reward model | Required (separate model) | Not needed |
| Training stability | Difficult (reward hacking) | Stable (supervised loss) |
| GPU memory | 4× model copies | 2× (policy + reference) |
| Data format | Prompts + scalar rewards | Preference pairs (chosen/rejected) |
| Quality | Frontier models standard | Competitive for most tasks |
Newer Alternatives
ORPO
Odds Ratio Preference Optimization. Combines SFT and alignment in a single training pass. Requires no reference model at all — extremely memory-efficient.
SimPO
Length-regularised variant that prevents verbosity bias. Competes with DPO on benchmarks with simpler implementation.
Why It Matters for On-Premise
If you fine-tune an on-premise model on domain data, a DPO alignment pass (using pairs of good/bad responses your team rates) helps ensure the model behaves helpfully and refuses inappropriate requests — without the engineering overhead of PPO. Using QLoRA + DPO, you can run the entire alignment pipeline on a single 80GB A100 for a 7B model.