benchmark.darvinyi.com

Benchmarks

Every major LLM benchmark explained — what it tests, how tasks work, and where models stand.

 

Sort:
Agent TasksActive
2023

AgentBench

LLM agents across 8 interactive environments: OS, databases, web, games, and more.

1,091 tasks
6.8
MathActive
2024

AIME

Annual olympiad-level math competition used as a fresh, contamination-proof AI benchmark.

30 tasks
100% (Heavy)
reasoningNEW
2025

ARC-AGI-2

Second-generation abstract visual reasoning benchmark testing symbolic interpretation, compositional rule application, and context-aware reasoning. Pure LLMs score 0%; humans solve every task in ≤2 attempts.

Pending curation
ReasoningSaturated
2018

ARC-Challenge

Grade-school science questions that simple retrieval systems can't answer.

2,590 tasks
98.1%
classic

ARC-Challenge + HellaSwag

Two classic saturated benchmarks — ARC tests grade-school science reasoning (2,590 questions), HellaSwag tests commonsense sentence completion (10,000+ questions). Both exceeded by frontier models at ~95–98% accuracy.

Pending curation
ReasoningNearing Saturation
2022

BIG-Bench Hard

23 hard reasoning tasks where chain-of-thought is required to exceed human performance.

6,511 tasks
93.1%
codingNEW
2024

BigCodeBench

1,140 practical coding tasks requiring invocation of diverse library APIs across 139 libraries; models score ≤60% vs. 97% human baseline. Accepted as ICLR 2025 oral.

Pending curation
mathNEW
2024

FrontierMath

Hundreds of novel, research-grade mathematics problems across number theory, real analysis, algebraic geometry, and category theory, authored by expert mathematicians. Current best AI solves <2%.

Pending curation
Agent TasksActive
2023

GAIA

Multi-step real-world tasks that are conceptually simple for humans but require tool-using agents.

466 tasks
67.0%
ReasoningActive
2023

GPQA Diamond

PhD-level science questions so hard that even experts with Google still struggle.

198 tasks
94.3%
MathSaturated
2021

GSM8K

Grade-school math word problems requiring 2-8 step arithmetic reasoning.

8,500 tasks
99.7%
ReasoningSaturated
2019

HellaSwag

Commonsense reasoning — pick the most plausible continuation of an everyday activity.

70,000 tasks
96.4%
reasoningNEW
2025

HLE

2,500 expert-crafted academic questions spanning math, natural sciences, and humanities, designed to be Google-proof and exceed frontier model capabilities. Published in Nature (Jan 2025).

Pending curation
CodingSaturated
2021

HumanEval / HumanEval+

Python function completion from docstrings, evaluated by test execution.

164 tasks
97.6%
Contamination-ResistantActive
2024

LiveBench

Contamination-resistant benchmark refreshed monthly from recent sources with no LLM judge.

1,000 tasks
87.3%
CodingActive
2024

LiveCodeBench

Contamination-resistant coding benchmark using freshly released competition problems.

1,055 tasks
91.7%
Human PreferenceActive
2023

LMSYS Chatbot Arena

Crowdsourced human preference Elo ratings from millions of real user comparisons.

6,000,000 tasks
1549 Elo (Coding)
MathNearing Saturation
2021

MATH Benchmark

Competition-level math problems across 7 subjects, from AMC to AIME difficulty.

12,500 tasks
~97–98%
KnowledgeSaturated
2020

MMLU / MMLU-Pro

Broad academic knowledge across 57 subjects — the standard knowledge benchmark.

14,042 tasks
93.0%
multimodalNEW
2024

MMMU-Pro

Hardened extension of MMMU with 10-option MCQs, vision-only perceptual tasks, and multi-image questions. Models score 16–27%, vs. ~56% on the original MMMU.

Pending curation
agentNEW
2024

OSWorld

369 real-world computer-use tasks on Ubuntu/Windows/macOS requiring GUI navigation, desktop app control, and multi-app workflows. Humans succeed at 72%; best model achieved 12% at launch (NeurIPS 2024).

Pending curation
CodingContaminated
2023

SWE-bench

Can AI resolve real GitHub issues on production codebases?

2,294 tasks
80.9%
CodingActive
2025

SWE-Lancer

Real Upwork freelance software tasks mapped to $1M in economic value.

1,488 tasks
66.3%
Agent TasksActive
2024

TheAgentCompany

A simulated software company with 16 AI colleagues testing real office work tasks.

175 tasks
30.3%
agentic
2024

TheAgentCompany

175 professional workplace tasks in a simulated software company environment with 16 AI colleagues, 4 integrated platforms, and 6 job role domains. Best agent: ~43%. Published NeurIPS 2025.

Pending curation
ReasoningActive
2021

TruthfulQA

Can AI avoid repeating common myths and falsehoods that pervade its training data?

817 tasks
~78%
multimodalNEW
2024

Video-MME

Comprehensive multi-modal LLM video benchmark spanning 30 subfields and 11-second to 1-hour durations, with subtitle and audio-integrated questions. Accepted CVPR 2025; adopted by OpenAI as industry standard.

Pending curation
Agent TasksActive
2023

WebArena / VisualWebArena

Autonomous browser agents completing realistic tasks on functional sandboxed websites.

812 tasks
71.6%
Agent TasksActive
2024

τ-bench (tau-bench)

AI customer service agents that must follow policy while solving real customer problems.

200 tasks
84.7%