LiveBench
Contamination-resistant benchmark refreshed monthly from recent sources with no LLM judge.
What It Tests
LiveBench solves two core problems with standard benchmarks simultaneously: training data contamination and LLM-as-judge bias. Created by Colin White, Samuel Dooley, and colleagues and published at ICLR 2025 (Spotlight), it draws questions from recently published arXiv papers, news articles, math competition releases, and new datasets — ensuring questions couldn't have been in any model's training data.
All answers are objectively verifiable, eliminating the need for an LLM judge (which introduces judge model bias and can be gamed by models that perform similarly to the judge). Questions are updated monthly, with older questions replaced as they approach the risk of contamination.
The benchmark covers 7 categories and 23 tasks: Reasoning (spatial, temporal, counting, web of lies, logical deduction), Coding (code generation, code completion), Mathematics (AMPS hard, AIME-style, Olympiad), Language (connections, plot unscrambling, word sorting, typos), Data Analysis (tabular reasoning, forecasting), Instruction Following (length constraints, format constraints), and Agentic Coding (multi-file programming tasks from recent papers).
LiveBench is designed to remain challenging indefinitely by always including the hardest problems from new math competitions and the most demanding tasks from new research papers. As models improve, harder problems are added from the same freshly-released sources.
Task Anatomy
How a single task is structured.
Example Tasks
3 real examples from the benchmark.
Mathematics — Recent competition problem
Problem / Input
Because this problem is drawn from a competition released after training cutoffs, models cannot have memorized the answer. This is the core anti-contamination property.
Language — Connections puzzle
Problem / Input
Connections puzzles require careful disambiguation of words with multiple meanings. The puzzle is drawn from a specific NYT release date after training cutoffs.
Data Analysis — Forecasting from recent data
Problem / Input
Data analysis tasks use real recent datasets — the answers are objectively correct based on actual measured values, not opinion.
Leaderboard Results
Model scores sorted by performance.
8 results
| # | Model | Score |
|---|---|---|
| 1 | o4-mini | 87.3% |
| 2 | GPT-5 | ~82% |
| 3 | Opus 4.5 | ~79% |
| 4 | Gemini 2.5 Pro | ~76% |
| 5 | Qwen 3 72B | ~73% |
| 6 | DeepSeek R1 | ~72% |
| 7 | Llama 4 Maverick | ~67% |
| 8 | GPT-4o | ~53% |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
LiveBench provided the clearest empirical demonstration that static benchmarks are contaminated: models that scored high on MATH-500 showed significantly lower performance on equivalent LiveBench math questions from the same difficulty tier.
No LLM judge is a key design win: removing judge model bias makes LiveBench scores more reliable than benchmarks evaluated by GPT-4-class judges, which can favor outputs similar to GPT-4.
The monthly refresh design is the benchmark's core structural advantage — as AI capabilities improve, the benchmark automatically includes harder problems from the latest competitions and papers.
Score spread across the model population (~35% for GPT-4o-mini to ~87% for o4-mini) makes LiveBench one of the most discriminating active benchmarks for frontier model comparison.
Controversies & Caveats
Known limitations and criticisms.
Task rotation means that scores from different months are not directly comparable — a model's score in January vs. July reflects different sets of questions.
While anti-contamination by design, the sources (arXiv, NYT, competition sites) may be scraped into training data faster than LiveBench can update, potentially reducing the contamination gap over time.
The math category's difficulty level varies significantly month-to-month depending on which competitions released problems — making longitudinal comparison within the math category noisy.