WizardLM is back on the scene with a new paper titled "Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models".
The Problem Addressed
The research questions the standard approach of improving Generative Reward Models (GRM) by simply increasing the length of the deliveries. The authors argue that the structure of reasoning is just as important as length, especially in different evaluation contexts.
- Subjective Preference (e.g., Chat): Requires Breadth (B-CoT)โevaluating multiple dimensions like tone, format, and helpfulness simultaneously.
- Objective Correctness (e.g., Math/Code): Requires Depth (D-CoT)โrigorous, step-by-step deductive verification.
Forcing a model to "think longer" on a subjective chat task often just accumulates noise, while using broad aspects on a math problem misses critical logical flaws.
Mix-GRM: The Proposed Solution
The WizardLM team designed a framework called Mix-GRM that equips the GRM with both Breadth (B-CoT) and Depth (D-CoT) reasoning capabilities. The model was trained using Reinforcement Learning (RLVR) relying exclusively on final verdict supervisionโwith zero explicit routing labels. Amazingly, the model's structural alignment surged to 95%, autonomously learning to polarize its reasoning, dynamically selecting Breadth for Preference and Depth for Correctness.
Furthermore, Mix-GRM achieves superior performance while keeping token consumption within the exact same order of magnitude as standard single-pass reasoning, unlike length-scaling baselines that burn massive amounts of tokens.
For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!