SWE-bench
Can AI resolve real GitHub issues on production codebases?
What It Tests
SWE-bench tests whether AI systems can act as autonomous software engineers on real production codebases. Given a GitHub issue description and the full repository at a specific commit, the model must produce a code patch that makes failing tests pass without breaking passing ones. Tasks are sourced from merged pull requests across 12 popular Python repositories (Django, Flask, SymPy, scikit-learn, and others), ensuring every task has a verifiable ground truth.
Created by Carlos Jimenez, John Yang, and colleagues at Princeton NLP and published at ICLR 2024, SWE-bench was a landmark shift in AI evaluation: from abstract reasoning questions to real engineering work. The benchmark revealed a huge gap between models that score well on coding trivia and models that can actually fix software — early results showed GPT-4 solving under 2% of tasks.
The benchmark family now has six variants: Full (2,294 tasks), Lite (300 tasks), Verified (500 quality-filtered tasks — now deprecated due to contamination), Pro (1,865 harder multi-file tasks, current gold standard), Multilingual (300 tasks across 9 languages), and Multimodal (517 tasks with visual elements). In February 2026, OpenAI formally announced that SWE-bench Verified was contaminated — frontier models were reproducing verbatim gold patches from memory — and endorsed SWE-bench Pro as the new standard.
Task Anatomy
How a single task is structured.
Example Tasks
3 real examples from the benchmark.
Astropy: WCS convergence warning
Problem / Input
Illustrates a common SWE-bench pattern: the issue describes missing test coverage alongside a behavioral bug, requiring both a fix and a new test.
SymPy: Contains.as_set() raises NotImplementedError
Problem / Input
A common class of SWE-bench tasks: a method that should be implemented simply raises NotImplementedError as a placeholder.
Sphinx: Napoleon extension ignores Attributes section
Problem / Input
Harder because the bug is in a multi-stage text transformation pipeline where intermediate states are not easily inspected.
Leaderboard Results
Model scores sorted by performance.
7 results
| # | Model | Score |
|---|---|---|
| 1 | Opus 4.5 | 80.9% |
| 2 | Opus 4.6 | 80.8% |
| 3 | M2.5 | V80.2% |
| 4 | Kimi K2.6 | V80.2% |
| 5 | Gemini 2.5 Pro | 78.0% |
| 6 | GPT-5 | V74.9% |
| 7 | DeepSeek R1 | 73.3% |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
In 2023, GPT-4 solved under 2% of tasks. By 2025, frontier models exceeded 80% on the Verified split — a 40x improvement in two years.
The gap between SWE-bench Verified scores (~80%) and SWE-bench Pro scores (~58%) directly quantifies the contamination effect — models 'know' the test answers on Verified.
The benchmark revealed that models excellent at abstract code generation (HumanEval ~97%) fail catastrophically at real engineering: reading issue descriptions, navigating large codebases, and making targeted multi-file changes.
Multi-language performance is substantially lower than Python: Claude Opus 4.6 scores 80.8% on Python (Verified) but only 77.8% on the multilingual mix — with C/C++ being hardest.
SWE-bench Pro's commercial (private) subset shows a further drop: Claude Opus 4.1 falls from 23.1% (public GPL) to 17.8% (proprietary codebases), confirming that open-source familiarity contributes to performance.
Variants & Related
SWE-bench Lite
300 self-contained, single-file tasks. Designed for cost-effective evaluation. Top score: Claude Opus 4.6 at 62.7%.
SWE-bench Verified
500 tasks human-verified by 93 professional developers. Deprecated February 2026 after OpenAI found frontier models reproducing verbatim gold patches from training memory.
SWE-bench Pro
1,865 multi-file, multi-hour tasks using GPL-licensed repos. Current gold standard endorsed by OpenAI. Top score: GPT-5.4 at 57.7%.
SWE-bench Multilingual
300 tasks across C, C++, Go, Java, JavaScript, PHP, Ruby, Rust. Top score: Claude Opus 4.6 at 77.8%.
SWE-bench Multimodal
517 tasks where issues include screenshots, UI mockups, or design diagrams. Barely populated leaderboard — frontier models have not widely evaluated here yet.
Controversies & Caveats
Known limitations and criticisms.
February 23, 2026: OpenAI announced SWE-bench Verified is contaminated. Using GPT-5 as a red-teaming agent, they found that GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash could reproduce verbatim gold patches given only minimal task hints — including exact variable names and inline comments from human-written diffs they had never been shown in the prompt.
A February 2025 paper (LessLeak-Bench) independently confirmed 10.6% direct data leakage in SWE-bench Verified against StarCoder's training corpus — predating OpenAI's announcement.
OpenAI endorsing SWE-bench Pro (Scale AI) while leading its leaderboard was noted by some researchers as a potential conflict of interest.
SWE-bench Pro uses GPL-licensed repositories as a legal deterrent against training inclusion. Critics predict the contamination cycle will repeat within 6–12 months as Pro repos are crawled.