Mercor APEX + APEX-Agents
Professional knowledge work benchmark across investment banking, consulting, law, and medicine.
Overview
APEX (AI Productivity Index) is a family of benchmarks from Mercor — an AI recruiting and expert-data company — designed to measure AI performance on economically valuable professional knowledge work. The benchmark explicitly connects AI performance to the question: 'Can these models autonomously complete tasks that would otherwise require expensive human professionals?'
APEX (non-agentic) provides long-form professional tasks to models without tool access: a prompt, supporting documents, and a grading rubric created by industry experts with 7+ years of experience at top firms (Goldman Sachs, McKinsey, Cravath, UPenn Medicine). Models generate deliverables evaluated by a panel of LLM judges against per-criteria rubrics.
APEX-Agents (launched January 2026) is the agentic extension: AI agents are deployed into fully simulated workplace environments ('worlds') containing hundreds of realistic files in Google Workspace, Box, email, calendar, and code execution — created by VPs and MDs who spent 5-10 days building each world. Agents must navigate these environments autonomously to complete multi-hour professional tasks.
The benchmark includes a striking post-training result: Applied Compute trained GLM-4.7 on fewer than 2,000 Mercor-labeled expert tasks and achieved a 20-percentage-point improvement across all domains, jumping from 17th to 4th place overall — near-doubling the base model's performance. The gains generalized to benchmarks the model was never trained on, including GDPval and Toolathalon.
How It Works
Task setup, inputs, outputs, and evaluation.
Example Tasks
Real tasks from this evaluation system.
Investment Banking — M&A DCF Update
Investment Banking
Agent is embedded in a 172-file M&A dataroom for a fictitious European energy company acquisition. Must find the comparable companies spreadsheet, update the DCF model with new EBITDA figures found in a client email, and produce a revised valuation summary deck.
What the Agent Receives
What It Must Produce
How Success Is Judged
Management Consulting — Cost Reduction Deck
Management Consulting
Week-long consulting project for a European oil & gas company. Agent must review client documents, synthesize research PDFs, draft a slide deck with cost-cutting recommendations, and update the project timeline in the calendar.
What the Agent Receives
What It Must Produce
How Success Is Judged
Corporate Law — Risk Memo
Corporate Law
Agent reviews a share purchase agreement, cross-references regulatory guidance PDFs, drafts a risk memo identifying compliance gaps, and sends a summary email to the partner.
What the Agent Receives
What It Must Produce
How Success Is Judged
Results
Model performance on this evaluation.
| # | Model | Score |
|---|---|---|
| 1 | GPT-5 | 67.2% ± 2.4% |
| 2 | Opus 4.6 | 65.7% ± 2.6% |
| 3 | Gemini 3 Pro | 65.1% ± 2.4% |
| 4 | Grok 4 | V63.5% |
| 5 | GPT-5 | 38.4% ± 3.9% |
| 6 | Gemini 3 Pro | 33.5% ± 3.6% |
| 7 | Opus 4.6 | ~30% |
| 8 | Opus 4.5 | 18.4% |
Key Findings
No model currently meets production bar for autonomous task completion — even the best models satisfy only ~67% of rubric criteria on APEX and complete only ~38% of APEX-Agents tasks on the first attempt.
Post-training on fewer than 2,000 expert-labeled tasks doubled GLM-4.7's performance on APEX-Agents (17th → 4th place), with gains generalizing to other benchmarks — the strongest evidence that expert data fine-tuning transfers.
Corporate law tasks are consistently hardest across both APEX and APEX-Agents (law scores ~5-10 points below consulting and banking) — suggesting legal reasoning requires particularly specialized training.
Pass@8 (best of 8 attempts) is ~15-20 points higher than Pass@1 for top models, showing that agents are capable of getting tasks right but inconsistently — a key reliability problem for production deployment.
The Applied Compute experiment showed that improved performance came from better process (preserving details, sanity-checking, revising) rather than domain knowledge — suggesting these skills transfer across job categories.
What Makes It Unique
- ✓
World-building infrastructure: 33 environments containing ~166 real files each in authentic Google Workspace + email + calendar + file storage — no other benchmark deploys this level of workplace fidelity.
- ✓
Economic alignment: task distributions explicitly mirror how professionals spend their time (e.g., 30% of IB tasks are DCF modeling because IB analysts spend 30% on DCF modeling).
- ✓
Expert pedigree: tasks created by VPs and MDs with 5-10 years at Goldman Sachs, McKinsey, Latham & Watkins — not junior annotators or crowdworkers.
- ✓
Fully open-source: all 480 APEX-Agents tasks and the Archipelago evaluation infrastructure are CC-BY 4.0. No other professional-services benchmark is this open.
- ✓
Training signal validation: the Applied Compute result proves the benchmark's data teaches generalizable professional skills, not benchmark-specific patterns.
Controversies & Caveats
Mercor is a data company that sells expert training data — the benchmark validates their commercial product, creating a potential conflict of interest.
Web search disabled for reproducibility, which limits real-world validity (real professionals search constantly). Performance with web access would likely be higher.
Fictional entities used in most worlds; some researchers question whether fictional scenarios fully replicate the cognitive complexity of real client engagements.
The fully open-source devset (480 APEX-Agents tasks) can be trained on — Applied Compute's result demonstrates this works, but it means the devset scores are not reliable for untrained models.