Advanced Evaluation of Reward Models
Reward models (RMs) are central to aligning large language models (LLMs), but their effectiveness depends on generalization to unseen prompts and shifting distributions. Most existing RM evaluations rely on static, pre-annotated preference datasets, which offer limited coverage and often fail to accurately assess generalization in open-world settings.
Pairwise Maximum Discrepancy Competition (PMDC)
Pairwise Maximum Discrepancy Competition (PMDC) has been introduced, a dynamic and annotation-efficient framework for evaluating RM generalization using a large, unlabeled, open-domain prompt pool. PMDC actively selects prompt-response pairs that maximize disagreement between two RMs, yielding a compact set of highly contentious test cases. These cases are adjudicated by an oracle, and the resulting outcomes are aggregated via a Bradley-Terry model to produce a global ranking and pairwise win-rate landscape of RMs.
Results and Analysis
The application of PMDC to 10 representative RMs revealed a substantial rank reshuffling compared with conventional benchmarks. Qualitative analyses further uncovered systematic generalization failures, providing valuable insights for improving reward modeling.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!