benchmark.darvinyi.com
← Back to Benchmarks
Human PreferenceActive

LMSYS Chatbot Arena

Crowdsourced human preference Elo ratings from millions of real user comparisons.

Tasks6,000,000
Year2023
CreatorLianmin Zheng, Wei-Lin Chiang, et al.
MetricArena Score (Elo-like; ~1200 is competitive, ~1500 is top tier)

What It Tests

Chatbot Arena is the world's largest crowdsourced AI evaluation platform, where real users submit any prompt they want and vote on which of two randomly selected anonymous models gives the better response. Operated by LMSYS Org (UC Berkeley, UCSD, CMU) and launched in May 2023, the platform has collected over 6 million human preference votes across commercial and open-source models.

The evaluation methodology uses the Bradley-Terry statistical model to compute Elo-like scores from pairwise comparisons. Unlike most benchmarks, Arena reflects genuine user preferences across the full diversity of real-world tasks — from creative writing to technical coding to emotional support — weighted by what users actually want to do with AI.

Arena includes sub-leaderboards for specific capabilities: Coding, Math, Instruction Following, Longer Query, Multilingual, and Vision. Each sub-leaderboard uses only prompts from users who engaged with that category, giving a more precise capability signal.

Arena's primary strength is ecological validity: it reflects what real users value, not what researchers think they should value. Its primary weakness is that 'what users value' includes style, formatting, and sycophancy — models can game human preferences by being verbose, agreeable, and aesthetically polished without being more accurate.

Task Anatomy

How a single task is structured.

InputAny user prompt (the user writes whatever they want to ask). Two anonymous models respond simultaneously. The user votes for the better response (Win A, Win B, Tie, or Both Bad).
OutputA vote for which model response was better.
EvaluationBradley-Terry model fit to all pairwise votes gives each model an Elo-like score. 95% confidence intervals are computed from bootstrap resampling. Models with overlapping confidence intervals are not reliably distinguishable.
MetricArena Score (Elo-like; ~1200 is competitive, ~1500 is top tier)

Example Tasks

2 real examples from the benchmark.

#1

Why Arena matters: Real user diversity

N/A — varies by user

Problem / Input

Users submit prompts like: - 'Write a cover letter for a software engineering job' - 'Explain quantum entanglement like I'm 5' - 'Debug this Python code: [500 lines of code]' - 'What's a good recipe for someone who hates vegetables?' - 'Help me write a breakup text that's kind but clear'
AnswerUser votes: Win A, Win B, Tie, or Both Bad

The key difference from static benchmarks: tasks are not predetermined. Arena measures what real users value across the full distribution of AI use cases.

#2

Style control experiment

N/A — methodological

Problem / Input

Researchers asked: do users prefer longer responses, regardless of quality? Style-controlled analysis stripped out length and formatting signals from votes to measure raw quality preference.
AnswerStyle-controlled scores differ from raw Elo scores, sometimes reshuffling model rankings.

This methodological concern is why Arena critics argue it measures 'appeal' rather than 'quality.' The Arena team now provides style-controlled sub-leaderboards.

Leaderboard Results

Model scores sorted by performance.

9 results

Sort:
#ModelScore
1
Opus 4.6
1549 Elo (Coding)
2
Opus 4.6
~1503 Elo
3
GPT-5
~1490 Elo
4
Gemini 2.5 Pro
~1480 Elo
5
Grok 4
~1465 Elo
6
Sonnet 4
~1430 Elo
7
Llama 4 Maverick
~1400 Elo
8
GPT-4o
~1380 Elo
9
DeepSeek V3
~1350 Elo

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

  • 6 million+ human preference votes make Arena the largest behavioral evaluation dataset in AI history — no other benchmark has this scale of real human judgment.

  • Arena Elo correlates with other benchmark scores but not perfectly: models that score similarly on MMLU or GPQA can differ by 50+ Arena Elo points, suggesting distinct user-valued capabilities not captured by academic benchmarks.

  • Sub-leaderboard analysis reveals specialization: the Coding sub-leaderboard ranking differs meaningfully from the General ranking. Models like Kimi K2.6 that excel at coding may rank lower on general Arena than their coding ability would suggest.

  • The voting process revealed that human preferences are not consistent — the same model pair produces different voting outcomes on similar prompts, requiring statistical aggregation over thousands of votes to get reliable signal.

Controversies & Caveats

Known limitations and criticisms.

Style bias: users prefer longer, formatted, and more agreeable responses even when they are less accurate. Models can improve Arena scores by being verbose and sycophantic without improving actual quality.

The Llama 4 'slop' incident (April 2025): Meta's Llama 4 debuted at high Arena scores but received community criticism for 'answer farming' — optimizing for what users vote for rather than actual quality. Arena scores temporarily conflated style optimization with genuine capability.

Sampling disparity: a 2024 paper found that top models were sampled from in Arena battles proportionally more than mid-tier models, creating statistical artifacts in Elo calculations.

Gameable by labs: if a lab knows the Arena voting patterns, they could fine-tune specifically to appeal to Arena voters rather than to be genuinely better.

Not task-specific: a model that is excellent at coding but poor at roleplay might lose Arena battles dominated by roleplay users, making general Elo a poor signal for specific use cases.

Links