LMSYS Chatbot Arena
Crowdsourced human preference Elo ratings from millions of real user comparisons.
What It Tests
Chatbot Arena is the world's largest crowdsourced AI evaluation platform, where real users submit any prompt they want and vote on which of two randomly selected anonymous models gives the better response. Operated by LMSYS Org (UC Berkeley, UCSD, CMU) and launched in May 2023, the platform has collected over 6 million human preference votes across commercial and open-source models.
The evaluation methodology uses the Bradley-Terry statistical model to compute Elo-like scores from pairwise comparisons. Unlike most benchmarks, Arena reflects genuine user preferences across the full diversity of real-world tasks — from creative writing to technical coding to emotional support — weighted by what users actually want to do with AI.
Arena includes sub-leaderboards for specific capabilities: Coding, Math, Instruction Following, Longer Query, Multilingual, and Vision. Each sub-leaderboard uses only prompts from users who engaged with that category, giving a more precise capability signal.
Arena's primary strength is ecological validity: it reflects what real users value, not what researchers think they should value. Its primary weakness is that 'what users value' includes style, formatting, and sycophancy — models can game human preferences by being verbose, agreeable, and aesthetically polished without being more accurate.
Task Anatomy
How a single task is structured.
Example Tasks
2 real examples from the benchmark.
Why Arena matters: Real user diversity
Problem / Input
The key difference from static benchmarks: tasks are not predetermined. Arena measures what real users value across the full distribution of AI use cases.
Style control experiment
Problem / Input
Researchers asked: do users prefer longer responses, regardless of quality? Style-controlled analysis stripped out length and formatting signals from votes to measure raw quality preference.This methodological concern is why Arena critics argue it measures 'appeal' rather than 'quality.' The Arena team now provides style-controlled sub-leaderboards.
Leaderboard Results
Model scores sorted by performance.
9 results
| # | Model | Score |
|---|---|---|
| 1 | Opus 4.6 | 1549 Elo (Coding) |
| 2 | Opus 4.6 | ~1503 Elo |
| 3 | GPT-5 | ~1490 Elo |
| 4 | Gemini 2.5 Pro | ~1480 Elo |
| 5 | Grok 4 | ~1465 Elo |
| 6 | Sonnet 4 | ~1430 Elo |
| 7 | Llama 4 Maverick | ~1400 Elo |
| 8 | GPT-4o | ~1380 Elo |
| 9 | DeepSeek V3 | ~1350 Elo |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
6 million+ human preference votes make Arena the largest behavioral evaluation dataset in AI history — no other benchmark has this scale of real human judgment.
Arena Elo correlates with other benchmark scores but not perfectly: models that score similarly on MMLU or GPQA can differ by 50+ Arena Elo points, suggesting distinct user-valued capabilities not captured by academic benchmarks.
Sub-leaderboard analysis reveals specialization: the Coding sub-leaderboard ranking differs meaningfully from the General ranking. Models like Kimi K2.6 that excel at coding may rank lower on general Arena than their coding ability would suggest.
The voting process revealed that human preferences are not consistent — the same model pair produces different voting outcomes on similar prompts, requiring statistical aggregation over thousands of votes to get reliable signal.
Controversies & Caveats
Known limitations and criticisms.
Style bias: users prefer longer, formatted, and more agreeable responses even when they are less accurate. Models can improve Arena scores by being verbose and sycophantic without improving actual quality.
The Llama 4 'slop' incident (April 2025): Meta's Llama 4 debuted at high Arena scores but received community criticism for 'answer farming' — optimizing for what users vote for rather than actual quality. Arena scores temporarily conflated style optimization with genuine capability.
Sampling disparity: a 2024 paper found that top models were sampled from in Arena battles proportionally more than mid-tier models, creating statistical artifacts in Elo calculations.
Gameable by labs: if a lab knows the Arena voting patterns, they could fine-tune specifically to appeal to Arena voters rather than to be genuinely better.
Not task-specific: a model that is excellent at coding but poor at roleplay might lose Arena battles dominated by roleplay users, making general Elo a poor signal for specific use cases.