DPO (Direct Preference Optimization)

Training NEW

A simpler alignment technique than RLHF that directly fine-tunes a model on preferred vs rejected response pairs — no separate reward model needed.

DPO (Rafailov et al., 2023) reformulates alignment as a supervised learning problem. Instead of training a separate reward model and running PPO, you directly optimise the LLM policy to prefer "chosen" responses over "rejected" ones using a closed-form loss.

How It Works

The DPO loss function implicitly defines a reward using the log-probability ratio between the trained model and a frozen reference model. The model is pushed to increase the probability of preferred completions and decrease the probability of rejected ones — in a single forward pass, no RL infrastructure required.

DPO vs RLHF

DimensionRLHF (PPO)DPO
Reward modelRequired (separate model)Not needed
Training stabilityDifficult (reward hacking)Stable (supervised loss)
GPU memory4× model copies2× (policy + reference)
Data formatPrompts + scalar rewardsPreference pairs (chosen/rejected)
QualityFrontier models standardCompetitive for most tasks

Newer Alternatives

ORPO

Odds Ratio Preference Optimization. Combines SFT and alignment in a single training pass. Requires no reference model at all — extremely memory-efficient.

SimPO

Length-regularised variant that prevents verbosity bias. Competes with DPO on benchmarks with simpler implementation.

Why It Matters for On-Premise

If you fine-tune an on-premise model on domain data, a DPO alignment pass (using pairs of good/bad responses your team rates) helps ensure the model behaves helpfully and refuses inappropriate requests — without the engineering overhead of PPO. Using QLoRA + DPO, you can run the entire alignment pipeline on a single 80GB A100 for a 7B model.