Agent Evaluations

Benchmarks that test AI on real human work — not synthetic tasks. How much economic value can AI agents actually deliver?

Why real-work benchmarks are different

Standard benchmarks test isolated capabilities — a math problem, a code snippet, a multiple-choice question. Real-work benchmarks test complete, end-to-end tasks: drafting a legal memo, completing a freelance animation project, or analyzing patient records and producing a care plan. The gap between the two is enormous. Models scoring 80%+ on SWE-bench Verified complete fewer than 4% of real Upwork projects to client-acceptable quality.

4 Required Evaluations

Mandatory coverage

RequiredMercor (San Francisco)

2025

Mercor APEX + APEX-Agents

Professional knowledge work benchmark across investment banking, consulting, law, and medicine.

World-building infrastructure: 33 environments containing ~166 real files each in authentic Google Workspace + email + calendar + file storage — no other benchmark deploys this level of workplace fidelity.

GDPval

AI on real professional tasks from 44 occupations covering $3 trillion in annual wages.

GDP-grounded occupational selection: the only benchmark that uses Federal Reserve GDP data + BLS wage data to determine which occupations and tasks to include — a scientifically defensible sampling frame.

1320 tasks

47.6% win+tie

RequiredCAIS / Scale AI

2025

Remote Labor Index (RLI)

Real Upwork freelance projects: $143,991 of actual paid work, 2.5% maximum automation.

True economic grounding: every project has a real dollar value set by the real professional who did the work — no synthetic cost estimates.

Upwork HAPI

The first benchmark measuring how human expertise amplifies AI agent performance.

The only benchmark that measures human+agent collaboration as its primary signal, not agent performance in isolation.

322 tasks

93% (with human)

Additional Evaluations

METR (nonprofit, Berkeley-based)

2025

METR Time Horizon

How long can AI agents work autonomously? The 50%-success time horizon, doubling every 3 months.

Single interpretable number: 'This model can complete 50%-probability tasks that take a human expert X hours' is more meaningful to policymakers than '74.2% on MMLU'.

BigLaw Bench

Real legal work quality — what percent of a lawyer-quality deliverable does AI produce?

Negative-point scoring: hallucinating legal citations actively hurts your score — unlike most benchmarks where wrong answers simply fail to earn points.

tasks

89.22%