DPO Breaks the Mold: What Weight-Space Geometry Reveals about Reasoning Training

Six methods, one question: are they really different?

Anyone working with Large Language Models knows there are many techniques to distill reasoning abilities from larger teachers into smaller students. Often the focus is solely on the final benchmark, but a research group decided to look inside the weight updates, asking whether methods like RFT, DPO, or Offline GRPO are mechanistically distinct or converge to similar solutions. The experiment compared six approaches — SFT, RFT, DFT, RIFT, Offline GRPO, and DPO — by training a Qwen3-4B model with an attention-only LoRA on identical math rollouts. The analysis then shifted to weight-space geometry, using cosine similarity, principal-angle subspace analysis, linear mode connectivity, and CKA.

Reassuring collinearity: SFT, RFT, and RIFT travel in parallel

The first finding is a near-perfect overlap among the weight deltas produced by SFT, RFT, and RIFT. Cosine similarity exceeds 0.97 and the median principal angle between modules is just 7 degrees. GSM8K accuracy is also statistically indistinguishable, ranging from 87% to 88% with McNemar tests lacking significance. In practical terms, if the goal is a model that can reason reliably, these three techniques are interchangeable and share the same solution basin — valuable knowledge for those wanting to avoid unnecessary complexity.

Intentional divergence: DFT and Offline GRPO chart different courses

DFT (Direct Feedback Training) moves further from SFT than any reward-weighted method, despite being trained on the same data. Offline GRPO, instead, adds a component clearly orthogonal to the SFT direction: globally the orthogonal fraction is around 67%, rising to 86% in the model's deeper layers. Yet the model remains anchored in the SFT loss basin, suggesting that exploring new directions does not lead to chaotic drift but can be controlled. This detail is crucial for anyone seeking a balance between customization and stability.

DPO breaks the mold — and gets top results

The most extreme case is DPO, which sits in a subspace almost orthogonal to SFT, shows a mode-connectivity barrier, and sees CKA similarity collapse to roughly 0.46 in later layers. Yet in the experimental protocol, DPO reaches the highest GSM8K accuracy (93.5%), with a McNemar test highly significant compared to SFT. The price is a radically different weight update, which could affect robustness, transferability, or calibration — aspects to weigh carefully when planning a deployment.

What this means for on-premise fine-tuning

For those managing local infrastructure and needing full control over data and models, these results provide a compass. When simplicity and reproducibility are priorities, collinear techniques like SFT or RIFT are safe choices that do not demand excessive hardware for training. If instead you aim for peak performance and have adequate compute resources, DPO becomes an option to consider, but with the understanding that the update path is radically different and might interact unpredictably with quantization or serving in constrained environments. Those evaluating on-premise deployment can also find on AI-RADAR analytical frameworks to compare trade-offs between different fine-tuning strategies without relying on generic advice.