MATH Benchmark
Competition-level math problems across 7 subjects, from AMC to AIME difficulty.
What It Tests
The MATH benchmark, introduced at NeurIPS 2021 by Dan Hendrycks and colleagues at UC Berkeley, was designed to be a lasting challenge for AI systems. The authors sourced 12,500 problems from real high-school math competitions (AMC 8, AMC 10, AMC 12, AIME) to ensure human-validated difficulty and rich step-by-step solutions.
The benchmark spans 7 subject areas — Prealgebra, Algebra, Number Theory, Counting & Probability, Geometry, Intermediate Algebra, and Precalculus — with 5 difficulty levels (Level 1 = trivial, Level 5 = AIME-difficulty). Problems are written in LaTeX and models must produce answers in \boxed{} notation, evaluated by symbolic equivalence checking.
MATH was transformative: when introduced, GPT-3 scored only 6.9% and even Minerva (Google's specialized math model) reached 50.3% in 2022. The benchmark validated that math competition problems required genuine multi-step reasoning, not just pattern matching. However, frontier reasoning models (o3, GPT-5, Gemini 3 Pro) now score 93-97% on the widely-used MATH-500 subset, rendering it effectively saturated for frontier model comparison.
A commonly used 500-problem representative subset (MATH-500) is used for most modern evaluations. The vals.ai leaderboard stopped running new frontier models on MATH-500 in 2025, citing saturation and contamination risk.
Task Anatomy
How a single task is structured.
Example Tasks
3 real examples from the benchmark.
Prealgebra — Level 1
Problem / Input
The simplest MATH problems. Level 1 comprises ~7% of the test set. Modern models get near 100% on these.
Algebra — Level 3
Problem / Input
Requires recognizing the algebraic identity (x+y)² = x² + 2xy + y² and substituting the given values — a non-obvious two-step approach.
Number Theory — Level 5
Problem / Input
Level 5 problems require creative combinatorial insight. The inclusion-exclusion principle must be applied correctly with 7 terms. These problems still discriminate between reasoning models.
Leaderboard Results
Model scores sorted by performance.
10 results
| # | Model | Score |
|---|---|---|
| 1 | DeepSeek V3 | V97.3% |
| 2 | DeepSeek R1 | V97.3% |
| 3 | Gemini 3 Pro | V96.4% |
| 4 | Qwen 3 72B | V96.4% |
| 5 | GPT-5 | V~94% |
| 6 | o4-mini | V~93% |
| 7 | o3 | V~91.5% |
| 8 | Opus 4.5 | V~90% |
| 9 | Llama 4 Maverick | V~85% |
| 10 | Mistral Large 3 | V~80% |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
Score progression from 6.9% (GPT-3, 2021) to 97% (DeepSeek R1 / Qwen3, 2025) is one of AI's most dramatic benchmark arcs — a nearly 14x improvement in 4 years.
Minerva (Google, 2022) was the first model to break 50%, establishing that specialized math training was necessary before general scale caught up.
The benchmark proved that chain-of-thought prompting was essential: early few-shot prompting with step-by-step examples provided a ~20-30% boost over direct answer prompting.
Level 5 problems remain harder for most models (~65-70% for non-reasoning models) but frontier reasoning models now exceed 90% even there, making AIME the new discrimination tool.
Variants & Related
MATH-500
Representative 500-problem subset introduced by Lightman et al. (Let's Verify Step by Step). Used for most modern evaluations due to cost.
AIME
The actual American Invitational Mathematics Examination used as a live AI benchmark. New problems each February prevent contamination. See separate AIME entry.
Controversies & Caveats
Known limitations and criticisms.
Severe contamination: the full dataset has been publicly available on GitHub since 2021 and has appeared in most model training corpora. High scores on MATH-500 almost certainly reflect partial memorization.
Evaluation ambiguity: different symbolic normalizers give different scores. Simple string matching, SymPy symbolic checking, and regex-based normalization produce different pass rates for the same model outputs.
MATH Level 5 ≠ AIME: the benchmark describes Level 5 as 'AIME difficulty' but fresh AIME competition problems (released annually) clearly discriminate frontier models better, since they cannot be in training data.
vals.ai stopped running new frontier models on MATH-500 in 2025 due to saturation and contamination, recommending AIME 2025 for any meaningful frontier model comparison.