benchmark.darvinyi.com
← Back to Benchmarks
MathNearing Saturation

MATH Benchmark

Competition-level math problems across 7 subjects, from AMC to AIME difficulty.

Tasks12,500
Year2021
CreatorDan Hendrycks, Collin Burns, Saurav Kadavath, et al.
MetricAccuracy (% of problems with correct final answer)
Random Chance0%

What It Tests

The MATH benchmark, introduced at NeurIPS 2021 by Dan Hendrycks and colleagues at UC Berkeley, was designed to be a lasting challenge for AI systems. The authors sourced 12,500 problems from real high-school math competitions (AMC 8, AMC 10, AMC 12, AIME) to ensure human-validated difficulty and rich step-by-step solutions.

The benchmark spans 7 subject areas — Prealgebra, Algebra, Number Theory, Counting & Probability, Geometry, Intermediate Algebra, and Precalculus — with 5 difficulty levels (Level 1 = trivial, Level 5 = AIME-difficulty). Problems are written in LaTeX and models must produce answers in \boxed{} notation, evaluated by symbolic equivalence checking.

MATH was transformative: when introduced, GPT-3 scored only 6.9% and even Minerva (Google's specialized math model) reached 50.3% in 2022. The benchmark validated that math competition problems required genuine multi-step reasoning, not just pattern matching. However, frontier reasoning models (o3, GPT-5, Gemini 3 Pro) now score 93-97% on the widely-used MATH-500 subset, rendering it effectively saturated for frontier model comparison.

A commonly used 500-problem representative subset (MATH-500) is used for most modern evaluations. The vals.ai leaderboard stopped running new frontier models on MATH-500 in 2025, citing saturation and contamination risk.

Task Anatomy

How a single task is structured.

InputA natural-language math problem written in LaTeX, often from a real competition. Problems range from single-step arithmetic to multi-page proofs requiring creative insight.
OutputModel generates full step-by-step solution ending with the final answer inside \boxed{} LaTeX notation.
EvaluationSymbolic equivalence checking: ½, 0.5, and \frac{1}{2} all count as the same answer. Evaluated with 4-shot examples (Minerva convention) or 0-shot.
MetricAccuracy (% of problems with correct final answer)

Example Tasks

3 real examples from the benchmark.

#1

Prealgebra — Level 1

Level 1

Problem / Input

What is the value of (6² + 6) ÷ 6?
Answer\boxed{7}

The simplest MATH problems. Level 1 comprises ~7% of the test set. Modern models get near 100% on these.

#2

Algebra — Level 3

Level 3

Problem / Input

If x + y = 7 and x² + y² = 29, what is xy?
Answer\boxed{10}

Requires recognizing the algebraic identity (x+y)² = x² + 2xy + y² and substituting the given values — a non-obvious two-step approach.

#3

Number Theory — Level 5

Level 5

Problem / Input

How many positive integers less than 1000 are divisible by at least one of 3, 5, or 7?
Answer\boxed{542}

Level 5 problems require creative combinatorial insight. The inclusion-exclusion principle must be applied correctly with 7 terms. These problems still discriminate between reasoning models.

Leaderboard Results

Model scores sorted by performance.

10 results

Sort:
#ModelScore
1
DeepSeek V3
V97.3%
2
DeepSeek R1
V97.3%
3
Gemini 3 Pro
V96.4%
4
Qwen 3 72B
V96.4%
5
GPT-5
V~94%
6
o4-mini
V~93%
7
o3
V~91.5%
8
Opus 4.5
V~90%
9
Llama 4 Maverick
V~85%
10
Mistral Large 3
V~80%

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

  • Score progression from 6.9% (GPT-3, 2021) to 97% (DeepSeek R1 / Qwen3, 2025) is one of AI's most dramatic benchmark arcs — a nearly 14x improvement in 4 years.

  • Minerva (Google, 2022) was the first model to break 50%, establishing that specialized math training was necessary before general scale caught up.

  • The benchmark proved that chain-of-thought prompting was essential: early few-shot prompting with step-by-step examples provided a ~20-30% boost over direct answer prompting.

  • Level 5 problems remain harder for most models (~65-70% for non-reasoning models) but frontier reasoning models now exceed 90% even there, making AIME the new discrimination tool.

Variants & Related

MATH-500

500 tasksSaturated

Representative 500-problem subset introduced by Lightman et al. (Let's Verify Step by Step). Used for most modern evaluations due to cost.

AIME

30 tasksActive

The actual American Invitational Mathematics Examination used as a live AI benchmark. New problems each February prevent contamination. See separate AIME entry.

Controversies & Caveats

Known limitations and criticisms.

Severe contamination: the full dataset has been publicly available on GitHub since 2021 and has appeared in most model training corpora. High scores on MATH-500 almost certainly reflect partial memorization.

Evaluation ambiguity: different symbolic normalizers give different scores. Simple string matching, SymPy symbolic checking, and regex-based normalization produce different pass rates for the same model outputs.

MATH Level 5 ≠ AIME: the benchmark describes Level 5 as 'AIME difficulty' but fresh AIME competition problems (released annually) clearly discriminate frontier models better, since they cannot be in training data.

vals.ai stopped running new frontier models on MATH-500 in 2025 due to saturation and contamination, recommending AIME 2025 for any meaningful frontier model comparison.

Links