Chatbot Arena (Elo)

Benchmark

LMSYS Chatbot Arena — crowd-sourced side-by-side LLM battles judged by real users. Elo ratings derived from millions of human preference votes. The ground-truth human preference leaderboard.

Chatbot Arena (LMSYS, 2023–ongoing) is an open platform where users chat with two anonymous models simultaneously and vote for the better response. Elo ratings are computed from millions of pairwise comparisons by real users across real tasks. It is considered the most reliable signal for real-world human preference — unlike static multi-choice benchmarks, it cannot be gamed by targeted training.

How Elo Ratings Work Here

Each battle updates both models' Elo scores based on outcome. The Elo formula gives smaller rating changes for expected outcomes and larger changes for upsets — stabilising ratings over time. With 10M+ battles, confidence intervals are tight and rankings are robust to adversarial gaming.

Why It's the Most Trusted Signal

  • Real user traffic (not curated prompt sets)
  • Anonymous matchmaking prevents model-specific prompt optimisation
  • Aggregates across millions of different task types
  • Correlates strongly with commercial preference and downstream adoption

Elo Tiers (approximate, as of Q1 2026)

TierElo RangeRepresentative Models
Elite1350+o3, Gemini 2.5 Pro, Claude 3.7 Sonnet
Frontier1280–1349GPT-4o, Claude 3.5 Sonnet, Grok 3
Strong1200–1279Llama 3.1 70B, Mistral Large 2
Competitive1100–1199Llama 3.1 8B, Gemma 2 27B
Entry<1100Mistral 7B v0.1, Llama 2 7B

Limitations

Elo ratings reflect average user preference across all tasks — a model that is exceptional at code but mediocre at creative writing may rank below a well-rounded model. Task-specific capability is better measured with targeted benchmarks (HumanEval, MATH). Also, users may be swayed by formatting and confidence rather than correctness — making polished but wrong answers sometimes preferred.