SWE-Lancer
Real Upwork freelance software tasks mapped to $1M in economic value.
What It Tests
SWE-Lancer, published by OpenAI at ICML 2025, maps AI coding performance directly to economic value. Every task corresponds to a real payment made on Upwork: from $50 bug fixes to $32,000 feature implementations, for a total of $1,000,000 across 1,488 tasks.
The benchmark uses the Expensify open-source repository — a real commercial full-stack application — rather than the popular Python OSS repos used by SWE-bench. This means tasks involve the full stack (React Native, web, API, database) and are evaluated using end-to-end Playwright browser automation tests that simulate real user workflows, not narrow unit tests.
Two task types exist: IC SWE tasks (764 tasks, $414K value) where the model must produce a code patch, and SWE Manager tasks (724 tasks, $585K value) where the model selects the best implementation proposal from 3-4 options submitted by real freelancers — mimicking engineering management decisions.
The economic framing makes SWE-Lancer uniquely interpretable: a model that 'earns' $208,050 of the $500,800 Diamond subset means it could autonomously complete about 41% of the economic value of those tasks. Even the best models failed the majority of tasks, with full-stack E2E tests catching bugs that narrower unit tests would miss.
Task Anatomy
How a single task is structured.
Example Tasks
3 real examples from the benchmark.
Small Bug Fix — iOS submission button
Problem / Input
Represents the 88% of SWE-Lancer tasks that are bug fixes rather than features.
Manager Decision — PDF receipt attachment
Problem / Input
SWE Manager tasks test judgment about software design tradeoffs — an underexplored capability distinct from code generation.
Feature Implementation — Recurring subscriptions
Problem / Input
The highest-value tasks ($8K–$32K) typically require architectural changes across multiple subsystems.
Leaderboard Results
Model scores sorted by performance.
4 results
| # | Model | Score |
|---|---|---|
| 1 | GPT-5 | V66.3% |
| 2 | o3 | V37.3% |
| 3 | Sonnet 4 | 26.2% |
| 4 | GPT-4o | ~9% |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
The first benchmark to directly map AI performance to dollar-denominated economic value — every task has a real Upwork price tag.
E2E browser automation testing (Playwright) catches significantly more incorrect solutions than narrow unit tests, explaining why SWE-Lancer scores are lower than SWE-bench scores on equivalent task difficulty.
SWE Manager tasks (selecting best implementation proposal) proved nearly as challenging as IC SWE coding tasks — suggesting AI judgment on software design is a distinct, underexplored capability.
Even at $8,000 task values, models were expected to make multi-subsystem architectural changes — these tasks saw near-0% success rates in the original paper, contrasting sharply with the high benchmark scores claimed on simpler coding evals.
Performance doubled from GPT-4o (~9%) to Claude 3.5 Sonnet (~26%) in the original paper, demonstrating rapid progress on real full-stack engineering.
Controversies & Caveats
Known limitations and criticisms.
Single codebase limitation: all 1,488 tasks come from the Expensify repository. Performance may not generalize to other commercial full-stack codebases.
The original dataset required live internet access during execution, introducing environment variability. The July 2025 update removed this dependency.
Limited leaderboard: only 4-6 models formally evaluated, all from OpenAI. Anthropic and Google have not published results despite Claude 3.5 Sonnet being the best model in the original paper.
Management task validity: SWE Manager decisions are compared against a single engineering manager's choices. While 99% inter-rater agreement was found among expert reviewers, business decisions have legitimate disagreement.