The AI Benchmark
Explorer
Deep-dives into every major LLM benchmark — what they test, how they work, real task examples, and where the models actually stand.
Benchmarks
20
Agent Evals
6
Curated Models
18
Required Evals
4
APEX · GDPval · RLI · HAPI
The Real Work Gap
Models that ace structured benchmarks often fail dramatically on real end-to-end professional work. The gap is larger than most people expect.
SWE-bench Pro
58.6%
Resolving real GitHub issues
Contamination-resistant coding benchmark
GAIA (Top Agent)
67.0%
Real-world multi-step tasks
With tool access; human baseline is 92%
RLI Automation Rate
3.75%
Real Upwork freelance projects
$143,991 of actual paid work • 240 projects
The same models scoring 80%+ on structured benchmarks complete under 4% of real freelance projects to client-acceptable quality.
Browse by Category
20 benchmarks across 8 categories.
Coding & Software Engineering
Benchmarks testing code generation, bug fixing, and real-world software engineering tasks.
4 benchmarks
Mathematical Reasoning
From grade-school word problems to olympiad-level competition mathematics.
3 benchmarks
Reasoning & Knowledge
Logic, commonsense inference, expert science, and broad academic knowledge.
5 benchmarks
Knowledge
Broad academic knowledge across dozens of subjects and domains.
1 benchmark
Agent Tasks
Multi-turn autonomous tasks: web browsing, tool use, OS interaction, and customer service.
5 benchmarks
Human Preference
Crowdsourced Elo ratings from real users in pairwise model comparisons.
1 benchmark
Contamination-Resistant
Benchmarks designed to stay fresh — monthly rotating questions that models can't memorize.
1 benchmark
Benchmark Status
Which benchmarks still differentiate frontier models?
Training data leakage detected; no longer reliable for frontier model comparison.
4 Required Agentic Evaluations
These four systems evaluate AI on real, economically-grounded work — not synthetic tasks. They represent the most rigorous tests of whether AI can actually replace human labor.