Adaptive Alignment of LLMs with Best-of-Tails

A recent study introduces Best-of-Tails (BoT), an adaptive alignment framework for large language models (LLMs) during inference. The goal is to overcome the limitations of existing strategies, which oscillate between "optimistic" approaches (like Best-of-$N$) and regularized "pessimistic" methods.

The Optimistic-Pessimistic Dilemma

Optimistic strategies tend to suffer from reward hacking, i.e., exploiting the weaknesses of the reward model. Pessimistic methods, on the other hand, can limit the exploration of high-quality responses. BoT addresses this trade-off by analyzing the distribution of rewards and dynamically adapting the selection strategy.

How Best-of-Tails Works

BoT uses Tsallis divergence as a tunable regularizer, interpolating between optimistic and pessimistic approaches. The framework estimates the reward-tail heaviness for each prompt, adjusting the selection rule accordingly. This dynamic balancing between exploration and alignment aims to improve the performance of LLMs in different contexts, such as mathematics, multiple-choice reasoning, and human-preference evaluations.