Chatbot Arena (LMSYS, 2023–ongoing) is an open platform where users chat with two anonymous models simultaneously and vote for the better response. Elo ratings are computed from millions of pairwise comparisons by real users across real tasks. It is considered the most reliable signal for real-world human preference — unlike static multi-choice benchmarks, it cannot be gamed by targeted training.
How Elo Ratings Work Here
Each battle updates both models' Elo scores based on outcome. The Elo formula gives smaller rating changes for expected outcomes and larger changes for upsets — stabilising ratings over time. With 10M+ battles, confidence intervals are tight and rankings are robust to adversarial gaming.
Why It's the Most Trusted Signal
- Real user traffic (not curated prompt sets)
- Anonymous matchmaking prevents model-specific prompt optimisation
- Aggregates across millions of different task types
- Correlates strongly with commercial preference and downstream adoption
Elo Tiers (approximate, as of Q1 2026)
| Tier | Elo Range | Representative Models |
|---|---|---|
| Elite | 1350+ | o3, Gemini 2.5 Pro, Claude 3.7 Sonnet |
| Frontier | 1280–1349 | GPT-4o, Claude 3.5 Sonnet, Grok 3 |
| Strong | 1200–1279 | Llama 3.1 70B, Mistral Large 2 |
| Competitive | 1100–1199 | Llama 3.1 8B, Gemma 2 27B |
| Entry | <1100 | Mistral 7B v0.1, Llama 2 7B |
Limitations
Elo ratings reflect average user preference across all tasks — a model that is exceptional at code but mediocre at creative writing may rank below a well-rounded model. Task-specific capability is better measured with targeted benchmarks (HumanEval, MATH). Also, users may be swayed by formatting and confidence rather than correctness — making polished but wrong answers sometimes preferred.