CodingActive

SWE-Lancer

Real Upwork freelance software tasks mapped to $1M in economic value.

Tasks1,488

Year2025

CreatorSamuel Miserendino, Michele Wang, Tejal Patwardhan, Johannes Heidecke

Metric% of tasks resolved / $ earned out of total possible value

Random Chance0%

What It Tests

SWE-Lancer, published by OpenAI at ICML 2025, maps AI coding performance directly to economic value. Every task corresponds to a real payment made on Upwork: from $50 bug fixes to $32,000 feature implementations, for a total of $1,000,000 across 1,488 tasks.

The benchmark uses the Expensify open-source repository — a real commercial full-stack application — rather than the popular Python OSS repos used by SWE-bench. This means tasks involve the full stack (React Native, web, API, database) and are evaluated using end-to-end Playwright browser automation tests that simulate real user workflows, not narrow unit tests.

Two task types exist: IC SWE tasks (764 tasks, $414K value) where the model must produce a code patch, and SWE Manager tasks (724 tasks, $585K value) where the model selects the best implementation proposal from 3-4 options submitted by real freelancers — mimicking engineering management decisions.

The economic framing makes SWE-Lancer uniquely interpretable: a model that 'earns' $208,050 of the $500,800 Diamond subset means it could autonomously complete about 41% of the economic value of those tasks. Even the best models failed the majority of tasks, with full-stack E2E tests catching bugs that narrower unit tests would miss.

Task Anatomy

How a single task is structured.

InputFor IC SWE tasks: a GitHub issue description, the full Expensify codebase, and an Upwork payment amount. For SWE Manager tasks: an issue description plus 3-4 implementation proposals submitted by real freelancers.

OutputIC SWE: A code patch resolving the issue. SWE Manager: Selection of the best proposal from the options provided.

EvaluationIC SWE tasks are evaluated by Playwright end-to-end browser automation tests that simulate real user workflows — clicking through the app, submitting forms, verifying UI state. Tests are triple-verified by professional engineers. SWE Manager tasks are graded against the original engineering manager's choice (99% inter-rater agreement confirmed by 5 expert reviewers).

Metric% of tasks resolved / $ earned out of total possible value

Example Tasks

3 real examples from the benchmark.

Small Bug Fix — iOS submission button

Easy

Problem / Input

Issue ($150): On iOS, the expense submission button is unresponsive when the user's session has expired. The button appears active but nothing happens when tapped. Model receives: Full Expensify React Native codebase, issue description, environment.

AnswerPatch to the React Native touch handler code. Playwright test: simulate expired session → click submit → verify login modal appears.

Represents the 88% of SWE-Lancer tasks that are bug fixes rather than features.

Manager Decision — PDF receipt attachment

Medium

Problem / Input

Issue ($2,400): Implement PDF receipt attachment support for expense reports. Three proposals submitted by real freelancers: A) Modifies attachment handling pipeline, 200 lines B) Uses existing document viewer component, 80 lines (simpler) C) Adds new PDF service layer, 400 lines (more scalable) Which proposal would the engineering manager select?

AnswerOption B

SWE Manager tasks test judgment about software design tradeoffs — an underexplored capability distinct from code generation.

Feature Implementation — Recurring subscriptions

Hard

Problem / Input

Issue ($8,000): Add support for recurring expense subscriptions with configurable billing cycles (weekly, monthly, quarterly). Must include new UI components, backend subscription logic, and integration with the existing expense report pipeline.

AnswerPlaywright test: create subscription → modify billing cycle → verify charges appear correctly in expense reports over simulated time.

The highest-value tasks ($8K–$32K) typically require architectural changes across multiple subsystems.

Leaderboard Results

Model scores sorted by performance.

4 results

Sort:

#	Model	Score	Setup	Date
1	GPT-5OpenAI	V66.3%	IC SWE Diamond subset	2025-08
2	o3OpenAI	V37.3%	IC SWE Diamond subset	2025-04
3	Sonnet 4Anthropic	26.2%	IC SWE Diamond; best original paper result	2025-02
4	GPT-4oOpenAI	~9%	IC SWE Diamond; original paper	2025-02

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

The first benchmark to directly map AI performance to dollar-denominated economic value — every task has a real Upwork price tag.
E2E browser automation testing (Playwright) catches significantly more incorrect solutions than narrow unit tests, explaining why SWE-Lancer scores are lower than SWE-bench scores on equivalent task difficulty.
SWE Manager tasks (selecting best implementation proposal) proved nearly as challenging as IC SWE coding tasks — suggesting AI judgment on software design is a distinct, underexplored capability.
Even at $8,000 task values, models were expected to make multi-subsystem architectural changes — these tasks saw near-0% success rates in the original paper, contrasting sharply with the high benchmark scores claimed on simpler coding evals.
Performance doubled from GPT-4o (~9%) to Claude 3.5 Sonnet (~26%) in the original paper, demonstrating rapid progress on real full-stack engineering.

Controversies & Caveats

Known limitations and criticisms.

⚠

Single codebase limitation: all 1,488 tasks come from the Expensify repository. Performance may not generalize to other commercial full-stack codebases.

⚠

The original dataset required live internet access during execution, introducing environment variability. The July 2025 update removed this dependency.

⚠

Limited leaderboard: only 4-6 models formally evaluated, all from OpenAI. Anthropic and Google have not published results despite Claude 3.5 Sonnet being the best model in the original paper.

⚠

Management task validity: SWE Manager decisions are compared against a single engineering manager's choices. While 99% inter-rater agreement was found among expert reviewers, business decisions have legitimate disagreement.

Links

arXiv Paper ↗Official Leaderboard ↗GitHub ↗