DPO (Direct Preference Optimization) – LLM Glossary

DPO (Rafailov et al., 2023) reformulates alignment as a supervised learning problem. Instead of training a separate reward model and running PPO, you directly optimise the LLM policy to prefer "chosen" responses over "rejected" ones using a closed-form loss.

How It Works

The DPO loss function implicitly defines a reward using the log-probability ratio between the trained model and a frozen reference model. The model is pushed to increase the probability of preferred completions and decrease the probability of rejected ones — in a single forward pass, no RL infrastructure required.

DPO vs RLHF

Dimension	RLHF (PPO)	DPO
Reward model	Required (separate model)	Not needed
Training stability	Difficult (reward hacking)	Stable (supervised loss)
GPU memory	4× model copies	2× (policy + reference)
Data format	Prompts + scalar rewards	Preference pairs (chosen/rejected)
Quality	Frontier models standard	Competitive for most tasks

Newer Alternatives

ORPO

Odds Ratio Preference Optimization. Combines SFT and alignment in a single training pass. Requires no reference model at all — extremely memory-efficient.

SimPO

Length-regularised variant that prevents verbosity bias. Competes with DPO on benchmarks with simpler implementation.

Why It Matters for On-Premise

If you fine-tune an on-premise model on domain data, a DPO alignment pass (using pairs of good/bad responses your team rates) helps ensure the model behaves helpfully and refuses inappropriate requests — without the engineering overhead of PPO. Using QLoRA + DPO, you can run the entire alignment pipeline on a single 80GB A100 for a 7B model.