benchmark.darvinyi.com

Leaderboard

All models × all benchmarks. Green cells = near max score for that benchmark.

Top score
Mid range
Lower score
— = no data
ModelSWE-benchHumanEvalSWE-LancerLiveCodeBenchMATHGSM8KGPQAMMLUAIMEBIG-BenchTruthfulQAARC-ChallengeHellaSwagGAIAWebArenaAgentBenchτ-benchTheAgentCompanyLMSYSLiveBenchTheAgentCompanyARC-ChallengeHLENEWARC-AGI-2NEWFrontierMathNEWOSWorldNEWBigCodeBenchNEWVideo-MMENEWMMMU-ProNEW
GPT-5OpenAI
74.9%v
96.9%v
66.3%v
~79%v
~95%+v
99.7%v
85.7%v
93.0%v
94.6%v
~91%v
~75%v
~97%v
96.4%v
67.0%v
58.1%
6.8
~79%
~24%
~1426 Elo
~82%
Opus 4.5Anthropic
80.9%
95.7%v
87.1%
~97%v
~98.5%v
87.0%v
90.8%v
92.77%v
93.1%
~78%v
~96.5%v
~96%v
~57%
71.6%
~6.2
~56%
~1468 Elo
~79%
GPT-4oOpenAI
~9%
76.6%v
96.1%v
74.8%v
88.7%v
~11%v
~83%
~69%
~96%v
95.7%v
~15%
14.41%
4.27
~25%
8.6%
~1345 Elo
~53%
Gemini 2.5 ProGoogle
67.2% (multi)v
94.2%v
~98%v
86.4%v
89.2%v
88.0%v
89.2%
~52%
~48%
~48%
30.3%
1448 Elo
~76%
Llama 4 MaverickMeta
OSS
88.5%v
88.9%v
~95%v
69.8%v
87.8%v
~85%v
~62%v
92.3%v
~95%v
~7.4%
~1417 Elo
~67%
Sonnet 4Anthropic
72.7%v
26.2%
97.2%v
75.4%v
93.1%
~72%v
~42%
~5.2
84.7%
26.3%
~1410 Elo
DeepSeek V3DeepSeek
OSS
42.0%v
90.2%v
90.2%v
59.1%v
88.2%v
~87%v
~60%
95.2%v
95.8%v
~4.5
~1363 Elo
Qwen 3 72BAlibaba
OSS
92.7%v
83.6%v
96.4%v
~97%v
~78%v
88.7%v
79.2%v
~88%v
~94%v
~95%v
~73%
o3OpenAI
69.1%v
37.3%v
~97%v
99.2%v
87.7%v
96.7%v
98.1%v
~1411 Elo
DeepSeek R1DeepSeek
OSS
49.2%v
87.1%v
97.3%v
97.7%v
71.5%v
90.8%v
79.8%v
~72%
o4-miniOpenAI
68.1%v
97.6%v
85.9%v
~96%v
81.4%v
93.4%v
87.3%
Opus 4.6Anthropic
80.8%
~97–98%v
91.3%v
~90.8%v
99.79%v
1549 Elo (Coding)
Gemini 3 ProGoogle
91.7%
96.4%v
94.3%
~93%v
Mistral Large 3Mistral
93.6%v
~93%v
87.1%v
~65%v
Grok 4xAI
87.5–88.9%v
92.7%v
100% (Heavy)v
Kimi K2.6Moonshot
OSS
80.2%v
89.6%v
~86%v
M2.5MiniMax
OSS
80.2%v
~86.5%v
Grok 3xAI
1402 Elo

Scores reflect best available result per model per benchmark. v = vendor self-reported. Benchmark names truncated — click to view full details.