benchmark.darvinyi.com
← Back to Benchmarks
CodingContaminated⚠ Deprecated — training data leakage confirmed

SWE-bench

Can AI resolve real GitHub issues on production codebases?

Tasks2,294
Year2023
CreatorCarlos Jimenez, John Yang, et al.
Metric% Resolved
Random Chance0%

What It Tests

SWE-bench tests whether AI systems can act as autonomous software engineers on real production codebases. Given a GitHub issue description and the full repository at a specific commit, the model must produce a code patch that makes failing tests pass without breaking passing ones. Tasks are sourced from merged pull requests across 12 popular Python repositories (Django, Flask, SymPy, scikit-learn, and others), ensuring every task has a verifiable ground truth.

Created by Carlos Jimenez, John Yang, and colleagues at Princeton NLP and published at ICLR 2024, SWE-bench was a landmark shift in AI evaluation: from abstract reasoning questions to real engineering work. The benchmark revealed a huge gap between models that score well on coding trivia and models that can actually fix software — early results showed GPT-4 solving under 2% of tasks.

The benchmark family now has six variants: Full (2,294 tasks), Lite (300 tasks), Verified (500 quality-filtered tasks — now deprecated due to contamination), Pro (1,865 harder multi-file tasks, current gold standard), Multilingual (300 tasks across 9 languages), and Multimodal (517 tasks with visual elements). In February 2026, OpenAI formally announced that SWE-bench Verified was contaminated — frontier models were reproducing verbatim gold patches from memory — and endorsed SWE-bench Pro as the new standard.

Task Anatomy

How a single task is structured.

InputA GitHub issue title + body describing a bug or feature request, plus the full repository at the commit before the fix was merged.
OutputA unified diff (code patch) that modifies one or more files in the repository.
EvaluationA Docker container runs the repository at the base commit, applies the model's patch, then executes two test sets: FAIL_TO_PASS tests (which must now pass) and PASS_TO_PASS tests (which must not regress). A task is resolved only if both conditions hold.
Metric% Resolved

Example Tasks

3 real examples from the benchmark.

#1

Astropy: WCS convergence warning

Medium

Problem / Input

Instance: astropy__astropy-11693 Repo: astropy/astropy Issue: WCS.all_world2pix failed to converge when plotting WCS with non-linear distortion. The method raises ConvergenceWarning but there is no test for this specific behavior, and the exception should be a warning not a hard raise. FAIL_TO_PASS: ["astropy/wcs/wcsapi/tests/test_fitswcs.py::test_non_convergence_warning"]
AnswerPatch changes raise to warnings.warn in the WCS convergence code path, then adds a test asserting the warning is emitted rather than raised.

Illustrates a common SWE-bench pattern: the issue describes missing test coverage alongside a behavioral bug, requiring both a fix and a new test.

#2

SymPy: Contains.as_set() raises NotImplementedError

Medium

Problem / Input

Instance: sympy__sympy-23950 Repo: sympy/sympy Issue: Contains(x, FiniteSet(y)).as_set() raises NotImplementedError instead of returning the appropriate set expression. Expected: Contains(x, FiniteSet(y)).as_set() should return FiniteSet(y) FAIL_TO_PASS: ["sympy/sets/tests/test_contains.py::test_as_set"]
AnswerNew method added to sympy/sets/sets.py in the Contains class. Test verifies Contains(x, FiniteSet(y)).as_set() == FiniteSet(y) and Contains(x, S.Integers).as_set() == S.Integers.

A common class of SWE-bench tasks: a method that should be implemented simply raises NotImplementedError as a placeholder.

#3

Sphinx: Napoleon extension ignores Attributes section

Hard

Problem / Input

Instance: sphinx-doc__sphinx-8713 Repo: sphinx-doc/sphinx Issue: The Napoleon extension for Google-style docstrings does not render attributes listed under an 'Attributes' section in the output HTML. The attributes section is parsed but the output is empty. FAIL_TO_PASS: ["tests/test_ext_napoleon.py::test_attributes"]
AnswerBug in _parse_attributes_section method — attributes were being collected but not appended to the output list. One-line fix plus updated test assertions.

Harder because the bug is in a multi-stage text transformation pipeline where intermediate states are not easily inspected.

Leaderboard Results

Model scores sorted by performance.

7 results

Sort:
#ModelScore
1
Opus 4.5
80.9%
2
Opus 4.6
80.8%
3
M2.5
V80.2%
4
Kimi K2.6
V80.2%
5
Gemini 2.5 Pro
78.0%
6
GPT-5
V74.9%
7
DeepSeek R1
73.3%

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

  • In 2023, GPT-4 solved under 2% of tasks. By 2025, frontier models exceeded 80% on the Verified split — a 40x improvement in two years.

  • The gap between SWE-bench Verified scores (~80%) and SWE-bench Pro scores (~58%) directly quantifies the contamination effect — models 'know' the test answers on Verified.

  • The benchmark revealed that models excellent at abstract code generation (HumanEval ~97%) fail catastrophically at real engineering: reading issue descriptions, navigating large codebases, and making targeted multi-file changes.

  • Multi-language performance is substantially lower than Python: Claude Opus 4.6 scores 80.8% on Python (Verified) but only 77.8% on the multilingual mix — with C/C++ being hardest.

  • SWE-bench Pro's commercial (private) subset shows a further drop: Claude Opus 4.1 falls from 23.1% (public GPL) to 17.8% (proprietary codebases), confirming that open-source familiarity contributes to performance.

Variants & Related

SWE-bench Lite

300 tasksNearing Saturation

300 self-contained, single-file tasks. Designed for cost-effective evaluation. Top score: Claude Opus 4.6 at 62.7%.

SWE-bench Verified

500 tasksContaminated

500 tasks human-verified by 93 professional developers. Deprecated February 2026 after OpenAI found frontier models reproducing verbatim gold patches from training memory.

SWE-bench Pro

1,865 tasksActive

1,865 multi-file, multi-hour tasks using GPL-licensed repos. Current gold standard endorsed by OpenAI. Top score: GPT-5.4 at 57.7%.

SWE-bench Multilingual

300 tasksActive

300 tasks across C, C++, Go, Java, JavaScript, PHP, Ruby, Rust. Top score: Claude Opus 4.6 at 77.8%.

SWE-bench Multimodal

517 tasksActive

517 tasks where issues include screenshots, UI mockups, or design diagrams. Barely populated leaderboard — frontier models have not widely evaluated here yet.

Controversies & Caveats

Known limitations and criticisms.

February 23, 2026: OpenAI announced SWE-bench Verified is contaminated. Using GPT-5 as a red-teaming agent, they found that GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash could reproduce verbatim gold patches given only minimal task hints — including exact variable names and inline comments from human-written diffs they had never been shown in the prompt.

A February 2025 paper (LessLeak-Bench) independently confirmed 10.6% direct data leakage in SWE-bench Verified against StarCoder's training corpus — predating OpenAI's announcement.

OpenAI endorsing SWE-bench Pro (Scale AI) while leading its leaderboard was noted by some researchers as a potential conflict of interest.

SWE-bench Pro uses GPL-licensed repositories as a legal deterrent against training inclusion. Critics predict the contamination cycle will repeat within 6–12 months as Pro repos are crawled.

Links