GDPval
AI on real professional tasks from 44 occupations covering $3 trillion in annual wages.
Overview
GDPval was created by OpenAI and published in October 2025 to measure AI performance on tasks drawn directly from occupations that collectively produce the majority of US economic output. The name reflects its design philosophy: tasks are drawn from jobs in the top industries contributing to US GDP (per Federal Reserve data), ensuring coverage of economically important work rather than academically convenient tasks.
The benchmark covers 44 occupations across 9 major industries: Finance & Insurance, Professional/Technical Services, Healthcare, Manufacturing, Information Technology, Real Estate, Wholesale Trade, Retail, and Education. Occupations were selected based on total wages (BLS May 2024 data) within each sector, filtered to those where at least 60% of O*NET tasks are 'knowledge work.' These 44 occupations collectively earn approximately $3 trillion annually.
1,320 tasks (30 per occupation) were built from actual work products created by professionals with an average of 14 years of experience. Evaluation uses blind pairwise comparison by expert graders: AI deliverable vs. human-produced deliverable, with 9 comparisons per prompt (3 samples × 3 graders).
Key findings: Claude Opus 4.1 leads overall at 47.6% win+tie rate (primarily from aesthetic quality of formatted outputs like PDFs/slides). GPT-5 leads on accuracy and instruction-following. Performance has more than doubled from GPT-4o (spring 2024) to GPT-5 (summer 2025), following a near-linear improvement trend. GDPval-AA (Artificial Analysis extension) provides an Elo-based agentic leaderboard with 200+ models.
How It Works
Task setup, inputs, outputs, and evaluation.
Example Tasks
Real tasks from this evaluation system.
Legal — Motion to Dismiss Brief
Lawyer
Draft a legal brief for a motion to dismiss a breach-of-contract lawsuit. Must include proper legal formatting, correct citation of applicable precedents, and compelling argumentation.
What the Agent Receives
What It Must Produce
How Success Is Judged
Nursing — Care Plan
Registered Nurse
Based on patient history, vitals, and physician notes, create a comprehensive nursing care plan with SMART goals, nursing diagnoses, and planned interventions.
What the Agent Receives
What It Must Produce
How Success Is Judged
Engineering — Technical Review Report
Civil/Mechanical Engineer
Review engineering blueprints and technical specifications for a structural component. Identify code compliance issues and prepare a written report with specific recommendations.
What the Agent Receives
What It Must Produce
How Success Is Judged
Results
Model performance on this evaluation.
| # | Model | Score |
|---|---|---|
| 1 | Opus 4.5 | V47.6% win+tie |
| 2 | GPT-5 | V~40% |
| 3 | o3 | V35.2% |
| 4 | o4-mini | V29.1% |
| 5 | GPT-4o | V12.5% |
Key Findings
Performance has more than doubled from GPT-4o (~12.5%) to Claude Opus 4.1 (47.6%) in roughly one year, following a near-linear improvement trend across 4 model generations.
Claude Opus 4.1 leads on aesthetics (formatting, slide layout, PDF/Excel/PPT output quality) while GPT-5 leads on accuracy (correct calculations, domain-specific knowledge, instruction-following).
Models complete GDPval tasks approximately 1.4x faster and 1.6x cheaper than human experts under realistic 'review and fix' workflows — significantly less than the frequently-cited '100x' figure which uses naïve inference timing.
An experimental fine-tuned GPT-5 trained specifically on GDPval-style tasks showed further improvement, validating that targeted domain fine-tuning works for professional knowledge work.
GDPval-AA (Artificial Analysis Elo extension) enables agentic evaluation with shell access and web browsing. The Elo leaderboard shows GPT-5.5 and Claude Opus 4.7 competing at the top (~1771 Elo).
What Makes It Unique
- ✓
GDP-grounded occupational selection: the only benchmark that uses Federal Reserve GDP data + BLS wage data to determine which occupations and tasks to include — a scientifically defensible sampling frame.
- ✓
$3 trillion coverage: tasks represent work from occupations collectively earning $3T/year — the largest economic footprint of any benchmark by far.
- ✓
Real expert work product: tasks built from actual deliverables by professionals with 14-year average experience, not constructed by researchers.
- ✓
Multi-format deliverables: models must produce .pdf, .xlsx, .ppt, and text outputs — the only benchmark testing realistic multi-format professional output.
- ✓
Double-blind human expert graders: primary metric uses independent domain experts in a blind comparison — not LLM judges.
Controversies & Caveats
Self-evaluation bias: OpenAI created and published the benchmark, featured prominently while GPT-5 performed well. Objectivity questions remain.
The claim that AI is '100x faster/cheaper' than humans is based on naïve inference timing. Under realistic review-and-iterate workflows, the advantage is 1.4x faster and 1.6x cheaper — still significant but far less dramatic.
Blind comparisons were not perfectly blind: Claude outputs used first-person phrasing; Grok referred to itself by name. Graders may have inferred model identity.
Only 30 tasks per occupation is 'a limited, initial cut' per the paper. Not statistically robust for comparing models within specific occupations.
Physical and manual work is excluded by design (60% knowledge-work threshold), covering only a subset of GDP-contributing occupations.