Reference-Guided LLM Evaluators

The research addresses the challenge of aligning large language models (LLMs) in contexts where objective verification is not possible. It proposes a method that uses reference-guided LLM-evaluators to bridge this gap. The idea is that these evaluators, supported by reference outputs, can act as indirect "verifiers".

Evaluation Protocols and Results

Specific evaluation protocols have been developed to enhance LLM-based evaluators, leveraging reference outputs. Experiments show that the reference-guided approach significantly increases the accuracy of less capable LLM-judges, using references from frontier models. Even the most capable LLM-judges benefit from high-quality references, such as those written by humans.

Guided Self-Improvement

The study also demonstrates the utility of high-quality references in alignment tuning. LLMs, guided by references, are used as judges to self-improve. This reference-guided self-improvement produces better results than direct training (SFT) on reference outputs and reference-free self-improvement. The performance achieved is comparable to training with ArmoRM, a strongly finetuned reward model.

Specifically, the method achieved 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B. This corresponds to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard.

For those evaluating on-premise deployments, there are trade-offs to consider. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.