benchmark.darvinyi.com
← Back to Benchmarks
CodingSaturated

HumanEval / HumanEval+

Python function completion from docstrings, evaluated by test execution.

Tasks164
Year2021
CreatorMark Chen et al.
Metricpass@1 (single-shot success rate)
Random Chance0%

What It Tests

HumanEval was created by OpenAI in 2021 alongside the Codex model. It was a landmark shift in code evaluation: instead of measuring textual similarity (BLEU score), it runs the model's code against hidden unit tests — the code either works or it doesn't. This functional-correctness approach became the standard for all subsequent coding benchmarks.

The benchmark consists of 164 hand-crafted Python programming problems. Each problem shows the model only a function signature and a docstring with examples. The model must complete the function body, which is then executed against 7.7 hidden test cases on average.

HumanEval+ (EvalPlus), released at NeurIPS 2023, exposed a critical weakness: the original 7.7 tests per problem were too few, allowing many incorrect solutions to accidentally pass. EvalPlus added ~80x more test cases per problem using automated generation, revealing that 15–30% of 'passing' solutions on vanilla HumanEval were actually wrong. Even state-of-the-art models drop 8–15 percentage points on HumanEval+.

HumanEval is now effectively saturated — frontier models score 91–97%, within statistical noise of each other on only 164 problems. It remains useful as a minimum baseline check but no longer discriminates between top models. LiveCodeBench, SWE-bench, and BigCodeBench have replaced it for frontier evaluation.

Task Anatomy

How a single task is structured.

InputA Python function signature plus a docstring describing the function's behavior, with 1-3 input/output examples embedded in the docstring.
OutputThe model generates the complete function body (everything after the docstring).
EvaluationThe generated code is executed in a sandbox against hidden unit tests. Pass/fail is binary per problem. Score = pass@k, the probability that at least one of k generated samples passes all tests.
Metricpass@1 (single-shot success rate)

Example Tasks

3 real examples from the benchmark.

#1

HumanEval/0 — has_close_elements

Easy

Problem / Input

from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each
    other than given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
AnswerTrue or False depending on whether any two numbers are closer than the threshold.

The canonical first problem. Simple nested loop, tests basic iteration logic.

#2

HumanEval/2 — triangle_area

Easy

Problem / Input

def triangle_area(a, h):
    """Given length of a side and high return area for a triangle.
    >>> triangle_area(5, 3)
    7.5
    """
Answer7.5 for triangle_area(5, 3)

The simplest possible problem — a one-line solution. Tests that models don't overthink trivial tasks.

#3

HumanEval/32 — find_zero

Hard

Problem / Input

import math

def poly(xs: list, x: float):
    """
    Evaluates polynomial with coefficients xs at point x.
    return xs[0] + xs[1] * x + xs[1] * x^2 + .... xs[n] * x^n
    """
    return sum([coeff * math.pow(x, i) for i, coeff in enumerate(xs)])

def find_zero(xs: list):
    """ xs are coefficients of a polynomial.
    find_zero finds x such that poly(x) = 0.
    find_zero returns only one zero point, even if there are many.
    Moreover, find_zero only takes list xs having even number of coefficients
    and largest non zero coefficient as it guarantees a solution.
    >>> round(find_zero([1, 2]), 2) # f(x) = 1 + 2x
    -0.5
    >>> round(find_zero([-6, 11, -6, 1]), 2) # (x-1)(x-2)(x-3) = -6 + 11x - 6x^2 + x^3
    1.0
    """
Answer-0.5 for [1, 2] (solving 1 + 2x = 0)

Requires implementing numerical root-finding (bisection method). One of the harder problems in the set.

Leaderboard Results

Model scores sorted by performance.

7 results

Sort:
#ModelScore
1
o4-mini
V97.6%
2
GPT-5
V96.9%
3
Opus 4.5
V95.7%
4
Gemini 2.5 Pro
V94.2%
5
Qwen 3 72B
V92.7%
6
DeepSeek V3
V90.2%
7
Llama 4 Maverick
V88.5%

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

  • HumanEval scores have gone from 28.8% (Codex, 2021) to 97%+ (frontier models, 2025) — a nearly complete saturation arc in 4 years.

  • The benchmark that launched the functional-correctness paradigm in code evaluation is now most useful as a minimum floor check, not a frontier discriminator.

  • HumanEval+ exposed that models were 'getting lucky' on sparse tests: Claude 3.5 Sonnet drops from 92.1% to 81.7% on HumanEval+ — a 10-point gap that reveals how much sparse testing inflates scores.

  • The pass@k metric introduced by the paper became the standard for all subsequent code generation evaluation, influencing SWE-bench and LiveCodeBench design.

Variants & Related

HumanEval+

164 tasksNearing Saturation

Same 164 problems with ~80x more test cases per problem (EvalPlus). Catches solutions that pass the sparse original tests but fail edge cases. Models drop 8–15 percentage points vs vanilla HumanEval.

HumanEval-XL

3,741 tasksActive

Extended to 23 natural languages (NL) and multiple programming languages. Tests multilingual instruction following for code generation.

Controversies & Caveats

Known limitations and criticisms.

Severe contamination: HumanEval problems and solutions circulate widely on the internet and appear in most training datasets. HumanEval-T (template variants with same logic, different surface form) shows all tested models drop 5–14 percentage points — direct evidence of memorization.

With only 164 problems, the 95% confidence interval at p=0.9 is approximately ±2.3 percentage points. Two models at 87% and 90% are statistically indistinguishable on HumanEval alone.

Narrow scope: Python only, isolated single-function generation, no multi-file context, no imports to manage, no interaction with external state. Does not reflect real software engineering.

HumanEval+ revealed that 15–30% of 'passing' solutions under vanilla HumanEval are actually incorrect when tested more rigorously. High HumanEval scores are artificially inflated.

Links