AlpacaEval

Benchmark

Automated instruction-following benchmark: 805 prompts judged by GPT-4. Measures win rate vs. text-davinci-003 baseline. Fast, cheap, and highly correlated with Chatbot Arena Elo ratings.

AlpacaEval (Li et al., 2023) evaluates instruction-following using 805 diverse prompts from Self-Instruct, Dolly, ViCuna, Koala, and HH-RLHF. A GPT-4 judge compares each model's response against the reference text-davinci-003 output and reports win rate. AlpacaEval 2.0 introduced a length-controlled win rate metric to correct for verbosity bias.

Why Length Control Matters

Early AlpacaEval results were gamed by models that produced verbose answers, since GPT-4 (the judge) tended to label longer responses as better even when the extra content was filler. AlpacaEval 2.0 corrects for this by controlling for output length in the win-rate regression, producing a length-controlled (LC) win rate that better reflects quality rather than quantity.

Correlation with Chatbot Arena

AlpacaEval 2.0 LC win rate correlates with Chatbot Arena Elo at r ≈ 0.98 — the highest cross-benchmark correlation in the field. This makes AlpacaEval a cheap proxy for Chatbot Arena before a model has accumulated enough battle votes.

Scores (LC Win Rate vs. GPT-4 judge)

ModelLC Win Rate
GPT-4o57.5%
Claude 3.5 Sonnet60.9%
Llama 3.1 70B Instruct38.1%
Mistral 7B Instruct v0.323.7%
text-davinci-003 (baseline)50.0% by definition