Agent Evaluations
Benchmarks that test AI on real human work — not synthetic tasks. How much economic value can AI agents actually deliver?
Why real-work benchmarks are different
Standard benchmarks test isolated capabilities — a math problem, a code snippet, a multiple-choice question. Real-work benchmarks test complete, end-to-end tasks: drafting a legal memo, completing a freelance animation project, or analyzing patient records and producing a care plan. The gap between the two is enormous. Models scoring 80%+ on SWE-bench Verified complete fewer than 4% of real Upwork projects to client-acceptable quality.
4 Required Evaluations
Mandatory coverageMercor APEX + APEX-Agents
Professional knowledge work benchmark across investment banking, consulting, law, and medicine.
World-building infrastructure: 33 environments containing ~166 real files each in authentic Google Workspace + email + calendar + file storage — no other benchmark deploys this level of workplace fidelity.
GDPval
AI on real professional tasks from 44 occupations covering $3 trillion in annual wages.
GDP-grounded occupational selection: the only benchmark that uses Federal Reserve GDP data + BLS wage data to determine which occupations and tasks to include — a scientifically defensible sampling frame.
Remote Labor Index (RLI)
Real Upwork freelance projects: $143,991 of actual paid work, 2.5% maximum automation.
True economic grounding: every project has a real dollar value set by the real professional who did the work — no synthetic cost estimates.
Upwork HAPI
The first benchmark measuring how human expertise amplifies AI agent performance.
The only benchmark that measures human+agent collaboration as its primary signal, not agent performance in isolation.
Additional Evaluations
METR Time Horizon
How long can AI agents work autonomously? The 50%-success time horizon, doubling every 3 months.
Single interpretable number: 'This model can complete 50%-probability tasks that take a human expert X hours' is more meaningful to policymakers than '74.2% on MMLU'.
BigLaw Bench
Real legal work quality — what percent of a lawyer-quality deliverable does AI produce?
Negative-point scoring: hallucinating legal citations actively hurts your score — unlike most benchmarks where wrong answers simply fail to earn points.