Chatbot Arena (Elo) – LLM Glossary

Chatbot Arena (LMSYS, 2023–ongoing) is an open platform where users chat with two anonymous models simultaneously and vote for the better response. Elo ratings are computed from millions of pairwise comparisons by real users across real tasks. It is considered the most reliable signal for real-world human preference — unlike static multi-choice benchmarks, it cannot be gamed by targeted training.

How Elo Ratings Work Here

Each battle updates both models' Elo scores based on outcome. The Elo formula gives smaller rating changes for expected outcomes and larger changes for upsets — stabilising ratings over time. With 10M+ battles, confidence intervals are tight and rankings are robust to adversarial gaming.

Why It's the Most Trusted Signal

Real user traffic (not curated prompt sets)
Anonymous matchmaking prevents model-specific prompt optimisation
Aggregates across millions of different task types
Correlates strongly with commercial preference and downstream adoption

Elo Tiers (approximate, as of Q1 2026)

Tier	Elo Range	Representative Models
Elite	1350+	o3, Gemini 2.5 Pro, Claude 3.7 Sonnet
Frontier	1280–1349	GPT-4o, Claude 3.5 Sonnet, Grok 3
Strong	1200–1279	Llama 3.1 70B, Mistral Large 2
Competitive	1100–1199	Llama 3.1 8B, Gemma 2 27B
Entry	<1100	Mistral 7B v0.1, Llama 2 7B

Limitations

Elo ratings reflect average user preference across all tasks — a model that is exceptional at code but mediocre at creative writing may rank below a well-rounded model. Task-specific capability is better measured with targeted benchmarks (HumanEval, MATH). Also, users may be swayed by formatting and confidence rather than correctness — making polished but wrong answers sometimes preferred.