benchmark.darvinyi.com
← Back to Agent Evaluations
Required EvaluationUpwork Inc.2025

Upwork HAPI

The first benchmark measuring how human expertise amplifies AI agent performance.

Tasks322
Public322
Primary MetricCompletion rate (% of jobs where 100% of rubric criteria met); Human+Agent delta vs. Agent alone
Avg Human TimeVaries by job (9 hours to 100+ days per original completion)
Economic Value90% of jobs priced $10-$200; all under $500

Overview

HAPI (Human+Agent Productivity Index) is unique among all AI benchmarks: it explicitly measures the value that human expertise adds when working alongside AI agents, rather than measuring agents in isolation. Created by Upwork Inc. and announced November 13, 2025, it was accepted to NeurIPS 2025 Workshop as 'UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI.'

Upwork used 322 real fixed-price jobs previously posted by actual clients and completed by verified freelancers on their platform. Expert evaluators with 100% Job Success Scores (Upwork's highest rating) and 96,000+ collective hours on the platform created 5-20 pass/fail rubric criteria per job based on the original project description. AI agents first attempted each job autonomously, then expert freelancers provided structured feedback, and agents made additional attempts.

The core finding: human+agent collaboration increased project completion rates by up to 70% compared to agents working alone. A single ~20-minute expert review cycle produced dramatic improvements. Data Science & Analytics: 64% (agent alone) → 93% (with human). Engineering & Architecture: 30% → 50%. Sales & Marketing: 17% → 31%.

HAPI is strategically tied to Upwork's Uma platform — a meta-orchestration agent that pairs AI systems with expert freelancers. The benchmark validates Upwork's thesis that human-AI pairs, not AI alone, are the future of professional work.

How It Works

Task setup, inputs, outputs, and evaluation.

SetupReal fixed-price jobs from Upwork marketplace. Expert freelancer with 100% Job Success Score creates job-specific rubric (5-20 criteria). AI agent attempts job autonomously. Expert provides structured feedback. Agent attempts again with feedback. Process measures standalone performance AND value of human guidance.
InputOriginal Upwork job posting (as written by the real client) plus any files the client provided. Expert freelancer creates rubric based on the job description.
OutputComplete job deliverable — varies by category: code, designs, written content, translations, data analysis reports, spreadsheets, marketing materials.
EvaluationExpert grader checks each criterion: pass/fail. Completion = 100% of criteria met. Score before human feedback = standalone agent performance. Score after feedback = human+agent performance. Delta = value of human expertise.
MetricCompletion rate (% of criteria met); Human+Agent improvement over Agent alone

Example Tasks

Real tasks from this evaluation system.

#1

Data Science — Customer Churn Analysis

$1508-12 hours (original human completion)

Data Science & Analytics

Analyze a CSV dataset of customer transactions, identify churn patterns, build a classification model, and produce a report with visualizations.

What the Agent Receives

CSV dataset (customer_id, transaction_history, demographic_data), job description specifying required model type and visualization format.

What It Must Produce

Jupyter notebook or Python script with exploratory analysis, trained classification model, confusion matrix, ROC curve, and written report.

How Success Is Judged

Criteria: (1) Correct data preprocessing with missing value handling, (2) Appropriate model chosen with justification, (3) Model accuracy reported with validation, (4) ROC/AUC visualizations included, (5) Written interpretation addresses key churn drivers. Agent alone: 64% completion. With expert feedback on methodology: 93%.
#2

Engineering — Structural Load Calculation

$1206-10 hours (original human completion)

Engineering & Architecture

Create a structural load calculation spreadsheet for a single-story residential addition based on given dimensions and local building codes.

What the Agent Receives

Residential addition floor plan dimensions, local building codes (IBC 2021 relevant sections), load requirements.

What It Must Produce

Excel spreadsheet with dead load, live load, wind load, and seismic load calculations per code section, with formulas showing work.

How Success Is Judged

Criteria: (1) Correct dead load calculation per material specifications, (2) Live load per IBC table values, (3) Wind load per ASCE 7, (4) All formulas traceable to code sections, (5) Totals correct. Agent alone: ~30%. With expert engineering review: ~50%.
#3

Writing — SEO Blog Posts

$755-8 hours (original human completion)

Writing

Write 5 SEO-optimized blog posts for a dental practice targeting specific keywords with a professional but approachable tone.

What the Agent Receives

Target keywords list, brand voice guide, topic briefs for each post, competitor blog examples.

What It Must Produce

5 blog posts (~800 words each) with proper header structure, keyword integration, internal link suggestions, and meta descriptions.

How Success Is Judged

Criteria: (1) Each post targets specified keyword naturally, (2) Professional yet approachable tone consistent across all 5, (3) H1/H2/H3 structure present, (4) No keyword stuffing, (5) Meta descriptions under 160 chars. With editorial feedback on tone: +17pp completion.

Results

Model performance on this evaluation.

#ModelScore
1
Sonnet 4
V93% (with human)
2
Sonnet 4
V64% (alone)
3
GPT-5
V~50% (with human)
4
Gemini 2.5 Pro
V~31% (with human)
5
GPT-5
V~30% (alone)
6
Gemini 2.5 Pro
V17% (alone)

Key Findings

  • Human+agent collaboration increased project completion by up to 70% vs. agent alone — the single largest documented amplification of AI performance through human guidance.

  • The 20-minute human review insight: even a brief, focused expert review cycle (~20 minutes) dramatically elevated completion rates — validating the economic viability of human-in-the-loop workflows.

  • The collaboration benefit is uneven across categories: Data Science benefited most (+29pp), suggesting that domain-specific judgment (choosing the right model, validating statistical assumptions) is where human expertise amplifies AI most.

  • Design & Creative tasks showed the most variance — some creative jobs benefited enormously from human aesthetic feedback while others showed less improvement.

  • The benchmark validates Upwork's business thesis: human+agent pairs outperform agents alone on the very tasks AI is supposed to be replacing humans on.

What Makes It Unique

  • The only benchmark that measures human+agent collaboration as its primary signal, not agent performance in isolation.

  • Dynamic benchmark: will be refreshed regularly with new real Upwork jobs to prevent overfitting — addressing a key weakness of static benchmarks.

  • Two-stage evaluation: grades before AND after human feedback, directly quantifying the 'value of human oversight' per feedback iteration.

  • Real economics: jobs were previously paid for by real clients at real market rates.

  • NeurIPS 2025 Workshop acceptance gives academic credibility alongside commercial platform scale.

Controversies & Caveats

Jobs were explicitly selected for being simple (the bottom 6% of Upwork's gross services volume by complexity). Results don't generalize to typical Upwork project complexity.

Upwork has financial incentive to show human-in-the-loop superiority — it preserves their human talent marketplace against pure-AI alternatives.

Only 3 models were tested in the initial release — limited model comparisons versus APEX or METR.

Rubrics are created by the same people who evaluate — potential for unconscious bias in choosing criteria that AI can/cannot meet.

100% JSS evaluators are extreme outliers among Upwork freelancers — may not represent typical client quality expectations.

Links