Leaderboard
All models × all benchmarks. Green cells = near max score for that benchmark.
Top score
Mid range
Lower score
— = no data| Model | SWE-bench | HumanEval | SWE-Lancer | LiveCodeBench | MATH | GSM8K | GPQA | MMLU | AIME | BIG-Bench | TruthfulQA | ARC-Challenge | HellaSwag | GAIA | WebArena | AgentBench | τ-bench | TheAgentCompany | LMSYS | LiveBench |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GPT-5OpenAI | 74.9%v | 96.9%v | 66.3%v | ~79%v | ~94%v | 99.7%v | ~75%v | 91.4%v | 94.6%v | ~91%v | ~75%v | ~97%v | 96.4%v | 67.0%v | 58.1% | 6.8 | ~79% | ~24% | ~1490 Elo | ~82% |
GPT-4oOpenAI | — | — | ~9% | — | — | 96.1%v | ~53%v | 88.7%v | ~20%v | ~83% | ~69% | ~96%v | 95.7%v | ~15% | 14.41% | 4.27 | ~25% | 8.6% | ~1380 Elo | ~53% |
Opus 4.5Anthropic | 80.9% | 95.7%v | — | 87.1% | ~90%v | ~98.5%v | — | 91.8%v | 80.0%v | 93.1% | ~78%v | ~96.5%v | ~96%v | ~57% | 71.6% | ~6.2 | ~56% | — | — | ~79% |
Gemini 2.5 ProGoogle | 78.0% | 94.2%v | — | — | — | ~98%v | ~83%v | ~90%v | 86.7%v | 89.2% | — | — | — | ~52% | ~48% | — | ~48% | 30.3% | ~1480 Elo | ~76% |
Llama 4 MaverickMeta OSS | — | 88.5%v | — | — | ~85%v | ~95%v | ~60%v | 87.8%v | — | ~85%v | ~62%v | 92.3%v | ~95%v | — | — | — | — | ~7.4% | ~1400 Elo | ~67% |
Qwen 3 72BAlibaba OSS | — | 92.7%v | — | 83.6%v | 96.4%v | ~97%v | — | 88.7%v | ~72%v | ~88%v | — | ~94%v | ~95%v | — | — | — | — | — | — | ~73% |
Sonnet 4Anthropic | — | — | 26.2% | — | — | — | ~68%v | — | — | 93.1% | ~72%v | — | — | — | ~42% | ~5.2 | 84.7% | 26.3% | ~1430 Elo | — |
DeepSeek V3DeepSeek OSS | — | 90.2%v | — | — | 97.3%v | — | — | 88.2%v | — | ~87%v | ~60% | 95.2%v | 95.8%v | — | — | ~4.5 | — | — | ~1350 Elo | — |
DeepSeek R1DeepSeek OSS | 73.3% | — | — | 87.1%v | 97.3%v | 97.7%v | ~71%v | 90.8%v | 79.8%v | — | — | — | — | — | — | — | — | — | — | ~72% |
o3OpenAI | — | — | 37.3%v | — | ~91.5%v | 99.2%v | 87.7%v | — | 96.7%v | — | — | 98.1%v | — | — | — | — | — | — | — | — |
o4-miniOpenAI | — | 97.6%v | — | 85.9%v | ~93%v | — | 81.4%v | — | 93.4%v | — | — | — | — | — | — | — | — | — | — | 87.3% |
Gemini 3 ProGoogle | — | — | — | 91.7% | 96.4%v | — | 94.3% | — | ~93%v | — | — | — | — | — | — | — | — | — | — | — |
Grok 4xAI | — | — | — | — | — | — | ~85%v | 92.7%v | ~88%v | — | — | — | — | — | — | — | — | — | ~1465 Elo | — |
Mistral Large 3Mistral | — | — | — | — | ~80%v | ~93%v | — | 87.1%v | — | — | ~65%v | — | — | — | — | — | — | — | — | — |
Opus 4.6Anthropic | 80.8% | — | — | — | — | — | ~88%v | — | — | — | — | — | — | — | — | — | — | — | 1549 Elo (Coding) | — |
Kimi K2.6Moonshot OSS | 80.2%v | — | — | 89.6%v | — | — | — | ~86%v | — | — | — | — | — | — | — | — | — | — | — | — |
M2.5MiniMax OSS | 80.2%v | — | — | — | — | — | — | ~86.5%v | — | — | — | — | — | — | — | — | — | — | — | — |
Grok 3xAI | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
Scores reflect best available result per model per benchmark. v = vendor self-reported. Benchmark names truncated — click to view full details.