benchmark.darvinyi.com
Updated 2026

The AI Benchmark
Explorer

Deep-dives into every major LLM benchmark — what they test, how they work, real task examples, and where the models actually stand.

Benchmarks

20

Agent Evals

6

Curated Models

18

Required Evals

4

APEX · GDPval · RLI · HAPI

The Real Work Gap

Models that ace structured benchmarks often fail dramatically on real end-to-end professional work. The gap is larger than most people expect.

SWE-bench Pro

58.6%

Resolving real GitHub issues

Contamination-resistant coding benchmark

GAIA (Top Agent)

67.0%

Real-world multi-step tasks

With tool access; human baseline is 92%

RLI Automation Rate

3.75%

Real Upwork freelance projects

$143,991 of actual paid work • 240 projects

The same models scoring 80%+ on structured benchmarks complete under 4% of real freelance projects to client-acceptable quality.

Browse by Category

20 benchmarks across 8 categories.

Benchmark Status

Which benchmarks still differentiate frontier models?

4

4 Required Agentic Evaluations

These four systems evaluate AI on real, economically-grounded work — not synthetic tasks. They represent the most rigorous tests of whether AI can actually replace human labor.

Explore all agent evaluations →