benchmark.darvinyi.com

Leaderboard

All models × all benchmarks. Green cells = near max score for that benchmark.

Top score
Mid range
Lower score
— = no data
ModelSWE-benchHumanEvalSWE-LancerLiveCodeBenchMATHGSM8KGPQAMMLUAIMEBIG-BenchTruthfulQAARC-ChallengeHellaSwagGAIAWebArenaAgentBenchτ-benchTheAgentCompanyLMSYSLiveBench
GPT-5OpenAI
74.9%v
96.9%v
66.3%v
~79%v
~94%v
99.7%v
~75%v
91.4%v
94.6%v
~91%v
~75%v
~97%v
96.4%v
67.0%v
58.1%
6.8
~79%
~24%
~1490 Elo
~82%
GPT-4oOpenAI
~9%
96.1%v
~53%v
88.7%v
~20%v
~83%
~69%
~96%v
95.7%v
~15%
14.41%
4.27
~25%
8.6%
~1380 Elo
~53%
Opus 4.5Anthropic
80.9%
95.7%v
87.1%
~90%v
~98.5%v
91.8%v
80.0%v
93.1%
~78%v
~96.5%v
~96%v
~57%
71.6%
~6.2
~56%
~79%
Gemini 2.5 ProGoogle
78.0%
94.2%v
~98%v
~83%v
~90%v
86.7%v
89.2%
~52%
~48%
~48%
30.3%
~1480 Elo
~76%
Llama 4 MaverickMeta
OSS
88.5%v
~85%v
~95%v
~60%v
87.8%v
~85%v
~62%v
92.3%v
~95%v
~7.4%
~1400 Elo
~67%
Qwen 3 72BAlibaba
OSS
92.7%v
83.6%v
96.4%v
~97%v
88.7%v
~72%v
~88%v
~94%v
~95%v
~73%
Sonnet 4Anthropic
26.2%
~68%v
93.1%
~72%v
~42%
~5.2
84.7%
26.3%
~1430 Elo
DeepSeek V3DeepSeek
OSS
90.2%v
97.3%v
88.2%v
~87%v
~60%
95.2%v
95.8%v
~4.5
~1350 Elo
DeepSeek R1DeepSeek
OSS
73.3%
87.1%v
97.3%v
97.7%v
~71%v
90.8%v
79.8%v
~72%
o3OpenAI
37.3%v
~91.5%v
99.2%v
87.7%v
96.7%v
98.1%v
o4-miniOpenAI
97.6%v
85.9%v
~93%v
81.4%v
93.4%v
87.3%
Gemini 3 ProGoogle
91.7%
96.4%v
94.3%
~93%v
Grok 4xAI
~85%v
92.7%v
~88%v
~1465 Elo
Mistral Large 3Mistral
~80%v
~93%v
87.1%v
~65%v
Opus 4.6Anthropic
80.8%
~88%v
1549 Elo (Coding)
Kimi K2.6Moonshot
OSS
80.2%v
89.6%v
~86%v
M2.5MiniMax
OSS
80.2%v
~86.5%v
Grok 3xAI

Scores reflect best available result per model per benchmark. v = vendor self-reported. Benchmark names truncated — click to view full details.