HumanEval / HumanEval+
Python function completion from docstrings, evaluated by test execution.
What It Tests
HumanEval was created by OpenAI in 2021 alongside the Codex model. It was a landmark shift in code evaluation: instead of measuring textual similarity (BLEU score), it runs the model's code against hidden unit tests — the code either works or it doesn't. This functional-correctness approach became the standard for all subsequent coding benchmarks.
The benchmark consists of 164 hand-crafted Python programming problems. Each problem shows the model only a function signature and a docstring with examples. The model must complete the function body, which is then executed against 7.7 hidden test cases on average.
HumanEval+ (EvalPlus), released at NeurIPS 2023, exposed a critical weakness: the original 7.7 tests per problem were too few, allowing many incorrect solutions to accidentally pass. EvalPlus added ~80x more test cases per problem using automated generation, revealing that 15–30% of 'passing' solutions on vanilla HumanEval were actually wrong. Even state-of-the-art models drop 8–15 percentage points on HumanEval+.
HumanEval is now effectively saturated — frontier models score 91–97%, within statistical noise of each other on only 164 problems. It remains useful as a minimum baseline check but no longer discriminates between top models. LiveCodeBench, SWE-bench, and BigCodeBench have replaced it for frontier evaluation.
Task Anatomy
How a single task is structured.
Example Tasks
3 real examples from the benchmark.
HumanEval/0 — has_close_elements
Problem / Input
from typing import List
def has_close_elements(numbers: List[float], threshold: float) -> bool:
""" Check if in given list of numbers, are any two numbers closer to each
other than given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""The canonical first problem. Simple nested loop, tests basic iteration logic.
HumanEval/2 — triangle_area
Problem / Input
def triangle_area(a, h):
"""Given length of a side and high return area for a triangle.
>>> triangle_area(5, 3)
7.5
"""The simplest possible problem — a one-line solution. Tests that models don't overthink trivial tasks.
HumanEval/32 — find_zero
Problem / Input
import math
def poly(xs: list, x: float):
"""
Evaluates polynomial with coefficients xs at point x.
return xs[0] + xs[1] * x + xs[1] * x^2 + .... xs[n] * x^n
"""
return sum([coeff * math.pow(x, i) for i, coeff in enumerate(xs)])
def find_zero(xs: list):
""" xs are coefficients of a polynomial.
find_zero finds x such that poly(x) = 0.
find_zero returns only one zero point, even if there are many.
Moreover, find_zero only takes list xs having even number of coefficients
and largest non zero coefficient as it guarantees a solution.
>>> round(find_zero([1, 2]), 2) # f(x) = 1 + 2x
-0.5
>>> round(find_zero([-6, 11, -6, 1]), 2) # (x-1)(x-2)(x-3) = -6 + 11x - 6x^2 + x^3
1.0
"""Requires implementing numerical root-finding (bisection method). One of the harder problems in the set.
Leaderboard Results
Model scores sorted by performance.
7 results
| # | Model | Score |
|---|---|---|
| 1 | o4-mini | V97.6% |
| 2 | GPT-5 | V96.9% |
| 3 | Opus 4.5 | V95.7% |
| 4 | Gemini 2.5 Pro | V94.2% |
| 5 | Qwen 3 72B | V92.7% |
| 6 | DeepSeek V3 | V90.2% |
| 7 | Llama 4 Maverick | V88.5% |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
HumanEval scores have gone from 28.8% (Codex, 2021) to 97%+ (frontier models, 2025) — a nearly complete saturation arc in 4 years.
The benchmark that launched the functional-correctness paradigm in code evaluation is now most useful as a minimum floor check, not a frontier discriminator.
HumanEval+ exposed that models were 'getting lucky' on sparse tests: Claude 3.5 Sonnet drops from 92.1% to 81.7% on HumanEval+ — a 10-point gap that reveals how much sparse testing inflates scores.
The pass@k metric introduced by the paper became the standard for all subsequent code generation evaluation, influencing SWE-bench and LiveCodeBench design.
Variants & Related
HumanEval+
Same 164 problems with ~80x more test cases per problem (EvalPlus). Catches solutions that pass the sparse original tests but fail edge cases. Models drop 8–15 percentage points vs vanilla HumanEval.
HumanEval-XL
Extended to 23 natural languages (NL) and multiple programming languages. Tests multilingual instruction following for code generation.
Controversies & Caveats
Known limitations and criticisms.
Severe contamination: HumanEval problems and solutions circulate widely on the internet and appear in most training datasets. HumanEval-T (template variants with same logic, different surface form) shows all tested models drop 5–14 percentage points — direct evidence of memorization.
With only 164 problems, the 95% confidence interval at p=0.9 is approximately ±2.3 percentage points. Two models at 87% and 90% are statistically indistinguishable on HumanEval alone.
Narrow scope: Python only, isolated single-function generation, no multi-file context, no imports to manage, no interaction with external state. Does not reflect real software engineering.
HumanEval+ revealed that 15–30% of 'passing' solutions under vanilla HumanEval are actually incorrect when tested more rigorously. High HumanEval scores are artificially inflated.