benchmark.darvinyi.com
← Back to Benchmarks
Contamination-ResistantActive

LiveBench

Contamination-resistant benchmark refreshed monthly from recent sources with no LLM judge.

Tasks1,000
Year2024
CreatorColin White, Samuel Dooley, Manley Roberts, et al.
MetricGlobal Average score across all 7 categories

What It Tests

LiveBench solves two core problems with standard benchmarks simultaneously: training data contamination and LLM-as-judge bias. Created by Colin White, Samuel Dooley, and colleagues and published at ICLR 2025 (Spotlight), it draws questions from recently published arXiv papers, news articles, math competition releases, and new datasets — ensuring questions couldn't have been in any model's training data.

All answers are objectively verifiable, eliminating the need for an LLM judge (which introduces judge model bias and can be gamed by models that perform similarly to the judge). Questions are updated monthly, with older questions replaced as they approach the risk of contamination.

The benchmark covers 7 categories and 23 tasks: Reasoning (spatial, temporal, counting, web of lies, logical deduction), Coding (code generation, code completion), Mathematics (AMPS hard, AIME-style, Olympiad), Language (connections, plot unscrambling, word sorting, typos), Data Analysis (tabular reasoning, forecasting), Instruction Following (length constraints, format constraints), and Agentic Coding (multi-file programming tasks from recent papers).

LiveBench is designed to remain challenging indefinitely by always including the hardest problems from new math competitions and the most demanding tasks from new research papers. As models improve, harder problems are added from the same freshly-released sources.

Task Anatomy

How a single task is structured.

InputA question drawn from a recently published source (arXiv paper, math competition, news article). Each question has an objectively verifiable answer. Questions rotate monthly.
OutputModel generates an answer in the specified format for the task type.
EvaluationExact match or structured comparison against ground truth. No LLM judge — all answers are deterministically correct or incorrect.
MetricGlobal Average score across all 7 categories

Example Tasks

3 real examples from the benchmark.

#1

Mathematics — Recent competition problem

Hard

Problem / Input

From a 2024-2025 math competition (released after model training cutoff): A sequence satisfies a₁ = 1, a₂ = 2, and aₙ = 3aₙ₋₁ - aₙ₋₂ for n ≥ 3. Find the sum of all positive integers k such that aₖ divides a₂₀₂₅.
Answer3751

Because this problem is drawn from a competition released after training cutoffs, models cannot have memorized the answer. This is the core anti-contamination property.

#2

Language — Connections puzzle

Medium

Problem / Input

Group these 16 words into 4 categories of 4 related words: APPLE, ORANGE, PITCH, PLUM, CHERRY, BASS, OLIVE, PINE, TREBLE, CLEF, NOTE, KEY, FLAT, SHARP, SCALE, MAJOR (From a recent NYT Connections puzzle released after training cutoffs)
AnswerThe specific groupings from the puzzle release

Connections puzzles require careful disambiguation of words with multiple meanings. The puzzle is drawn from a specific NYT release date after training cutoffs.

#3

Data Analysis — Forecasting from recent data

Hard

Problem / Input

Based on the provided table of monthly renewable energy installation data (Jan 2024 - Oct 2024), predict the November 2024 solar capacity addition (GW) using the trend you observe. [Table with actual 2024 data released after training cutoffs]
AnswerSpecific numeric value matching the ground truth from the actual November 2024 data

Data analysis tasks use real recent datasets — the answers are objectively correct based on actual measured values, not opinion.

Leaderboard Results

Model scores sorted by performance.

8 results

Sort:
#ModelScore
1
o4-mini
87.3%
2
GPT-5
~82%
3
Opus 4.5
~79%
4
Gemini 2.5 Pro
~76%
5
Qwen 3 72B
~73%
6
DeepSeek R1
~72%
7
Llama 4 Maverick
~67%
8
GPT-4o
~53%

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

  • LiveBench provided the clearest empirical demonstration that static benchmarks are contaminated: models that scored high on MATH-500 showed significantly lower performance on equivalent LiveBench math questions from the same difficulty tier.

  • No LLM judge is a key design win: removing judge model bias makes LiveBench scores more reliable than benchmarks evaluated by GPT-4-class judges, which can favor outputs similar to GPT-4.

  • The monthly refresh design is the benchmark's core structural advantage — as AI capabilities improve, the benchmark automatically includes harder problems from the latest competitions and papers.

  • Score spread across the model population (~35% for GPT-4o-mini to ~87% for o4-mini) makes LiveBench one of the most discriminating active benchmarks for frontier model comparison.

Controversies & Caveats

Known limitations and criticisms.

Task rotation means that scores from different months are not directly comparable — a model's score in January vs. July reflects different sets of questions.

While anti-contamination by design, the sources (arXiv, NYT, competition sites) may be scraped into training data faster than LiveBench can update, potentially reducing the contamination gap over time.

The math category's difficulty level varies significantly month-to-month depending on which competitions released problems — making longitudinal comparison within the math category noisy.

Links