Upwork HAPI
The first benchmark measuring how human expertise amplifies AI agent performance.
Overview
HAPI (Human+Agent Productivity Index) is unique among all AI benchmarks: it explicitly measures the value that human expertise adds when working alongside AI agents, rather than measuring agents in isolation. Created by Upwork Inc. and announced November 13, 2025, it was accepted to NeurIPS 2025 Workshop as 'UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI.'
Upwork used 322 real fixed-price jobs previously posted by actual clients and completed by verified freelancers on their platform. Expert evaluators with 100% Job Success Scores (Upwork's highest rating) and 96,000+ collective hours on the platform created 5-20 pass/fail rubric criteria per job based on the original project description. AI agents first attempted each job autonomously, then expert freelancers provided structured feedback, and agents made additional attempts.
The core finding: human+agent collaboration increased project completion rates by up to 70% compared to agents working alone. A single ~20-minute expert review cycle produced dramatic improvements. Data Science & Analytics: 64% (agent alone) → 93% (with human). Engineering & Architecture: 30% → 50%. Sales & Marketing: 17% → 31%.
HAPI is strategically tied to Upwork's Uma platform — a meta-orchestration agent that pairs AI systems with expert freelancers. The benchmark validates Upwork's thesis that human-AI pairs, not AI alone, are the future of professional work.
How It Works
Task setup, inputs, outputs, and evaluation.
Example Tasks
Real tasks from this evaluation system.
Data Science — Customer Churn Analysis
Data Science & Analytics
Analyze a CSV dataset of customer transactions, identify churn patterns, build a classification model, and produce a report with visualizations.
What the Agent Receives
What It Must Produce
How Success Is Judged
Engineering — Structural Load Calculation
Engineering & Architecture
Create a structural load calculation spreadsheet for a single-story residential addition based on given dimensions and local building codes.
What the Agent Receives
What It Must Produce
How Success Is Judged
Writing — SEO Blog Posts
Writing
Write 5 SEO-optimized blog posts for a dental practice targeting specific keywords with a professional but approachable tone.
What the Agent Receives
What It Must Produce
How Success Is Judged
Results
Model performance on this evaluation.
| # | Model | Score |
|---|---|---|
| 1 | Sonnet 4 | V93% (with human) |
| 2 | Sonnet 4 | V64% (alone) |
| 3 | GPT-5 | V~50% (with human) |
| 4 | Gemini 2.5 Pro | V~31% (with human) |
| 5 | GPT-5 | V~30% (alone) |
| 6 | Gemini 2.5 Pro | V17% (alone) |
Key Findings
Human+agent collaboration increased project completion by up to 70% vs. agent alone — the single largest documented amplification of AI performance through human guidance.
The 20-minute human review insight: even a brief, focused expert review cycle (~20 minutes) dramatically elevated completion rates — validating the economic viability of human-in-the-loop workflows.
The collaboration benefit is uneven across categories: Data Science benefited most (+29pp), suggesting that domain-specific judgment (choosing the right model, validating statistical assumptions) is where human expertise amplifies AI most.
Design & Creative tasks showed the most variance — some creative jobs benefited enormously from human aesthetic feedback while others showed less improvement.
The benchmark validates Upwork's business thesis: human+agent pairs outperform agents alone on the very tasks AI is supposed to be replacing humans on.
What Makes It Unique
- ✓
The only benchmark that measures human+agent collaboration as its primary signal, not agent performance in isolation.
- ✓
Dynamic benchmark: will be refreshed regularly with new real Upwork jobs to prevent overfitting — addressing a key weakness of static benchmarks.
- ✓
Two-stage evaluation: grades before AND after human feedback, directly quantifying the 'value of human oversight' per feedback iteration.
- ✓
Real economics: jobs were previously paid for by real clients at real market rates.
- ✓
NeurIPS 2025 Workshop acceptance gives academic credibility alongside commercial platform scale.
Controversies & Caveats
Jobs were explicitly selected for being simple (the bottom 6% of Upwork's gross services volume by complexity). Results don't generalize to typical Upwork project complexity.
Upwork has financial incentive to show human-in-the-loop superiority — it preserves their human talent marketplace against pure-AI alternatives.
Only 3 models were tested in the initial release — limited model comparisons versus APEX or METR.
Rubrics are created by the same people who evaluate — potential for unconscious bias in choosing criteria that AI can/cannot meet.
100% JSS evaluators are extreme outliers among Upwork freelancers — may not represent typical client quality expectations.