Leaderboard
All models × all benchmarks. Green cells = near max score for that benchmark.
Top score
Mid range
Lower score
— = no data| Model | SWE-bench | HumanEval | SWE-Lancer | LiveCodeBench | MATH | GSM8K | GPQA | MMLU | AIME | BIG-Bench | TruthfulQA | ARC-Challenge | HellaSwag | GAIA | WebArena | AgentBench | τ-bench | TheAgentCompany | LMSYS | LiveBench | TheAgentCompany | ARC-Challenge | HLENEW | ARC-AGI-2NEW | FrontierMathNEW | OSWorldNEW | BigCodeBenchNEW | Video-MMENEW | MMMU-ProNEW |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GPT-5OpenAI | 74.9%v | 96.9%v | 66.3%v | ~79%v | ~95%+v | 99.7%v | 85.7%v | 93.0%v | 94.6%v | ~91%v | ~75%v | ~97%v | 96.4%v | 67.0%v | 58.1% | 6.8 | ~79% | ~24% | ~1426 Elo | ~82% | — | — | — | — | — | — | — | — | — |
Opus 4.5Anthropic | 80.9% | 95.7%v | — | 87.1% | ~97%v | ~98.5%v | 87.0%v | 90.8%v | 92.77%v | 93.1% | ~78%v | ~96.5%v | ~96%v | ~57% | 71.6% | ~6.2 | ~56% | — | ~1468 Elo | ~79% | — | — | — | — | — | — | — | — | — |
GPT-4oOpenAI | — | — | ~9% | — | 76.6%v | 96.1%v | 74.8%v | 88.7%v | ~11%v | ~83% | ~69% | ~96%v | 95.7%v | ~15% | 14.41% | 4.27 | ~25% | 8.6% | ~1345 Elo | ~53% | — | — | — | — | — | — | — | — | — |
Gemini 2.5 ProGoogle | 67.2% (multi)v | 94.2%v | — | — | — | ~98%v | 86.4%v | 89.2%v | 88.0%v | 89.2% | — | — | — | ~52% | ~48% | — | ~48% | 30.3% | 1448 Elo | ~76% | — | — | — | — | — | — | — | — | — |
Llama 4 MaverickMeta OSS | — | 88.5%v | — | — | 88.9%v | ~95%v | 69.8%v | 87.8%v | — | ~85%v | ~62%v | 92.3%v | ~95%v | — | — | — | — | ~7.4% | ~1417 Elo | ~67% | — | — | — | — | — | — | — | — | — |
Sonnet 4Anthropic | 72.7%v | — | 26.2% | — | 97.2%v | — | 75.4%v | — | — | 93.1% | ~72%v | — | — | — | ~42% | ~5.2 | 84.7% | 26.3% | ~1410 Elo | — | — | — | — | — | — | — | — | — | — |
DeepSeek V3DeepSeek OSS | 42.0%v | 90.2%v | — | — | 90.2%v | — | 59.1%v | 88.2%v | — | ~87%v | ~60% | 95.2%v | 95.8%v | — | — | ~4.5 | — | — | ~1363 Elo | — | — | — | — | — | — | — | — | — | — |
Qwen 3 72BAlibaba OSS | — | 92.7%v | — | 83.6%v | 96.4%v | ~97%v | ~78%v | 88.7%v | 79.2%v | ~88%v | — | ~94%v | ~95%v | — | — | — | — | — | — | ~73% | — | — | — | — | — | — | — | — | — |
o3OpenAI | 69.1%v | — | 37.3%v | — | ~97%v | 99.2%v | 87.7%v | — | 96.7%v | — | — | 98.1%v | — | — | — | — | — | — | ~1411 Elo | — | — | — | — | — | — | — | — | — | — |
DeepSeek R1DeepSeek OSS | 49.2%v | — | — | 87.1%v | 97.3%v | 97.7%v | 71.5%v | 90.8%v | 79.8%v | — | — | — | — | — | — | — | — | — | — | ~72% | — | — | — | — | — | — | — | — | — |
o4-miniOpenAI | 68.1%v | 97.6%v | — | 85.9%v | ~96%v | — | 81.4%v | — | 93.4%v | — | — | — | — | — | — | — | — | — | — | 87.3% | — | — | — | — | — | — | — | — | — |
Opus 4.6Anthropic | 80.8% | — | — | — | ~97–98%v | — | 91.3%v | ~90.8%v | 99.79%v | — | — | — | — | — | — | — | — | — | 1549 Elo (Coding) | — | — | — | — | — | — | — | — | — | — |
Gemini 3 ProGoogle | — | — | — | 91.7% | 96.4%v | — | 94.3% | — | ~93%v | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
Mistral Large 3Mistral | — | — | — | — | 93.6%v | ~93%v | — | 87.1%v | — | — | ~65%v | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
Grok 4xAI | — | — | — | — | — | — | 87.5–88.9%v | 92.7%v | 100% (Heavy)v | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
Kimi K2.6Moonshot OSS | 80.2%v | — | — | 89.6%v | — | — | — | ~86%v | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
M2.5MiniMax OSS | 80.2%v | — | — | — | — | — | — | ~86.5%v | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
Grok 3xAI | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | 1402 Elo | — | — | — | — | — | — | — | — | — | — |
Scores reflect best available result per model per benchmark. v = vendor self-reported. Benchmark names truncated — click to view full details.