benchmark.darvinyi.com
← Back to Benchmarks
CodingActive

LiveCodeBench

Contamination-resistant coding benchmark using freshly released competition problems.

Tasks1,055
Year2024
CreatorNaman Jain, King Han, Alex Gu, et al.
Metricpass@1 (code generation scenario)
Random Chance0%

What It Tests

LiveCodeBench solves the core problem plaguing static coding benchmarks: training data contamination. By continuously collecting new problems from competitive programming platforms (LeetCode, AtCoder, Codeforces) and tagging each with its release date, evaluators can restrict evaluation to problems released after a model's training cutoff — ensuring models genuinely have not seen the problems before.

Created at UC Berkeley, MIT, and Cornell and published at ICLR 2025, LiveCodeBench has grown from 400 problems in March 2024 (v1) to 1,055 problems in April 2025 (v6). This continuous refresh means the benchmark resists saturation by design — new hard problems are always entering the pool.

A key innovation is testing four distinct scenarios rather than just code generation: (1) Code Generation — solve competitive programming problems; (2) Self-Repair — fix a failed solution given error feedback; (3) Code Execution — predict what a given code snippet outputs for a specific input; (4) Test Output Prediction — predict the correct output for a problem without writing code. These four scenarios together give a more complete picture of coding ability than generation alone.

The benchmark exposed dramatic contamination effects in other models: DeepSeek models showed a sharp performance drop on problems released after their training cutoff while performing well on pre-cutoff problems — clear evidence of memorization rather than genuine reasoning.

Task Anatomy

How a single task is structured.

InputA competitive programming problem statement with example input/output pairs. For LeetCode: additional starter code. For AtCoder/Codeforces: standard input format.
OutputA complete program that reads input and produces the correct output within time/memory limits.
EvaluationThe generated code is executed against hidden test cases. Score = pass@1. Four scenarios are tested: code generation, self-repair (fix failed code given error feedback), code execution (predict output), and test output prediction.
Metricpass@1 (code generation scenario)

Example Tasks

3 real examples from the benchmark.

#1

Code Generation — Array Duplicate Count

Easy

Problem / Input

Problem: Given an array of integers, return the count of elements that appear more than once. Example Input: [1, 2, 2, 3, 3, 3] Example Output: 2 Constraints: 1 ≤ n ≤ 10^5, -10^9 ≤ nums[i] ≤ 10^9
Answer2 for [1, 2, 2, 3, 3, 3] (elements 2 and 3 appear more than once)

Easy-tier problem. Tests basic frequency counting. Models should get ~95% on this difficulty.

#2

Code Execution — Predict Output

Medium

Problem / Input

What does this function return for the given input?

from collections import Counter

def count(nums):
    freq = Counter(nums)
    max_freq = max(freq.values())
    return sum(k for k, v in freq.items() if v == max_freq)

Input: count([1, 2, 2, 3])
Answer2

Code execution tasks test whether models understand program semantics, not just generate code. Models must trace execution mentally.

#3

Test Output Prediction — Inversion Count

Hard

Problem / Input

Problem: Count inversions in an array (pairs (i,j) where i < j but arr[i] > arr[j]). Test input: [3, 1, 2] What is the correct output?
Answer2

Output prediction tests deep understanding of the problem specification — models must reason about program behavior without executing code.

Leaderboard Results

Model scores sorted by performance.

7 results

Sort:
#ModelScore
1
Gemini 3 Pro
91.7%
2
Kimi K2.6
V89.6%
3
DeepSeek R1
V87.1%
4
Opus 4.5
87.1%
5
o4-mini
V85.9%
6
Qwen 3 72B
V83.6%
7
GPT-5
V~79%

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

  • Contamination detection: DeepSeek models showed a sharp performance cliff on problems released after their training cutoff while performing well on pre-cutoff problems — the first clear empirical demonstration of competitive programming contamination.

  • Four-scenario design reveals that models optimized for code generation often struggle at code execution and test output prediction — distinct capabilities not captured by pass@1 alone.

  • LiveCodeBench v6 (1,055 problems) shows top models at 91%+ on the full corpus but substantially lower on the fresh post-cutoff window — the benchmark's self-refreshing design maintains discriminating power.

  • The benchmark revealed a systematic rank divergence from SWE-bench: models strong at algorithmic reasoning (LiveCodeBench) are not necessarily strong at practical engineering (SWE-bench), and vice versa.

Controversies & Caveats

Known limitations and criticisms.

Score inflation via window selection: scores on the full corpus (including pre-training-cutoff problems) are significantly higher than post-cutoff-only scores. Publishers often cite the more favorable number without disclosing which window was used.

Competition-programming focus: tasks are algorithmic (data structures, mathematics, graph theory) and don't cover real software engineering (no multi-file codebases, no APIs, no debugging existing systems).

LeetCode platform concentration: many problems come from LeetCode, which has a distinct style that some models are specifically fine-tuned on.

Rank divergence from SWE-bench is informative but confusing: Gemini 3 Pro scores 91.7% on LiveCodeBench but only ~78% on SWE-bench Verified — the skills are related but different enough to produce opposite rankings for some model pairs.

Links