Bias in Language Reward Models: Analysis and Mitigation
Fine-tuning language models using reward models (RMs) is vulnerable to undesirable behaviors. New research identifies persistent biases in several high-quality RMs, related to length, sycophancy, overconfidence, and model-specific style. An intervent...