benchmark.darvinyi.com

Updated 2026

The AI Benchmark
Explorer

Deep-dives into every major LLM benchmark — what they test, how they work, real task examples, and where the models actually stand.

Explore Benchmarks View Leaderboard

Benchmarks

20

Agent Evals

6

Curated Models

18

Required Evals

4

APEX · GDPval · RLI · HAPI

The Real Work Gap

Models that ace structured benchmarks often fail dramatically on real end-to-end professional work. The gap is larger than most people expect.

SWE-bench Pro

58.6%

Resolving real GitHub issues

Contamination-resistant coding benchmark

GAIA (Top Agent)

67.0%

Real-world multi-step tasks

With tool access; human baseline is 92%

RLI Automation Rate

3.75%

Real Upwork freelance projects

$143,991 of actual paid work • 240 projects

The same models scoring 80%+ on structured benchmarks complete under 4% of real freelance projects to client-acceptable quality.

Browse by Category

20 benchmarks across 8 categories.

Coding & Software Engineering

Benchmarks testing code generation, bug fixing, and real-world software engineering tasks.

4 benchmarks

Mathematical Reasoning

From grade-school word problems to olympiad-level competition mathematics.

3 benchmarks

Reasoning & Knowledge

Logic, commonsense inference, expert science, and broad academic knowledge.

5 benchmarks

Knowledge

Broad academic knowledge across dozens of subjects and domains.

1 benchmark

Agent Tasks

Multi-turn autonomous tasks: web browsing, tool use, OS interaction, and customer service.

5 benchmarks

Human Preference

Crowdsourced Elo ratings from real users in pairwise model comparisons.

1 benchmark

Contamination-Resistant

Benchmarks designed to stay fresh — monthly rotating questions that models can't memorize.

1 benchmark

Benchmark Status

Which benchmarks still differentiate frontier models?

Active (12)

SWE-LancerCoding LiveCodeBenchCoding GPQA DiamondReasoning AIMEMath TruthfulQAReasoning GAIAAgent Tasks WebArena / VisualWebArenaAgent Tasks AgentBenchAgent Tasks τ-bench (tau-bench)Agent Tasks TheAgentCompanyAgent Tasks LMSYS Chatbot ArenaHuman Preference LiveBenchContamination-Resistant

Contaminated (1)

SWE-benchContaminated

Training data leakage detected; no longer reliable for frontier model comparison.

Saturated / Nearing (7)

HumanEval / HumanEval+Saturated MATH BenchmarkNearing Saturation GSM8KSaturated MMLU / MMLU-ProSaturated BIG-Bench HardNearing Saturation ARC-ChallengeSaturated HellaSwagSaturated

4

4 Required Agentic Evaluations

These four systems evaluate AI on real, economically-grounded work — not synthetic tasks. They represent the most rigorous tests of whether AI can actually replace human labor.

Mercor APEX + APEX-AgentsMercor (San Francisco)GDPvalOpenAI Remote Labor Index (RLI)CAIS / Scale AI Upwork HAPIUpwork Inc.

Explore all agent evaluations →