GSM8K
Grade-school math word problems requiring 2-8 step arithmetic reasoning.
What It Tests
GSM8K (Grade School Math 8K) was created by OpenAI in 2021 to diagnose failures in multi-step mathematical reasoning. The insight was that even GPT-3 (175B parameters) — despite its scale — couldn't reliably solve problems that a bright 12-year-old could. The benchmark requires only basic arithmetic (+, -, ×, ÷) but chains 2–8 steps together in natural language, which was surprisingly difficult for 2021-era models.
The paper that introduced GSM8K also introduced verifier models — separate models trained to score candidate solutions — as a key technique for improving math performance. This idea later influenced RLHF development and process reward models.
GSM8K became the canonical benchmark for chain-of-thought (CoT) prompting research. The landmark Wei et al. (2022) paper demonstrated that CoT prompting gave a ~50% performance boost on GSM8K, establishing chain-of-thought as a fundamental technique. PaLM's self-consistency method (majority voting over multiple CoT samples) further boosted performance to 74.4%.
GSM8K is completely saturated for frontier models. GPT-5 scores 99.7%, with multiple models above 98%. The benchmark is now used only as a baseline floor check — models scoring below 90% are clearly not competitive on math tasks. For frontier discrimination, researchers use AIME (annual competition problems that can't be contaminated).
Task Anatomy
How a single task is structured.
Example Tasks
3 real examples from the benchmark.
Simple — Duck eggs
Problem / Input
This is the first problem in the dataset. Represents the easiest tier — simple sequential arithmetic.
Simple — Robe fiber
Problem / Input
Tests fraction understanding ('half that much') combined with addition.
Multi-step — House flip
Problem / Input
Multi-step with percentage increase: requires recognizing that '150% increase' means multiplying the original price by 1.5, then adding to the original. A common stumbling block for weaker models.
Leaderboard Results
Model scores sorted by performance.
9 results
| # | Model | Score |
|---|---|---|
| 1 | GPT-5 | V99.7% |
| 2 | o3 | V99.2% |
| 3 | Opus 4.5 | V~98.5% |
| 4 | Gemini 2.5 Pro | V~98% |
| 5 | DeepSeek R1 | V97.7% |
| 6 | Qwen 3 72B | V~97% |
| 7 | GPT-4o | V96.1% |
| 8 | Llama 4 Maverick | V~95% |
| 9 | Mistral Large 3 | V~93% |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
Score progression: GPT-3 (34.7%) → PaLM with CoT (56.5%) → GPT-4 (92%) → GPT-5 (99.7%). Complete saturation arc in 4 years.
This benchmark proved chain-of-thought prompting works: PaLM's GSM8K score jumped ~30 points when models were prompted to show their work step-by-step, establishing CoT as a fundamental technique.
Self-consistency (majority voting over multiple CoT samples) further boosted PaLM from 56.5% to 74.4% — demonstrating that sampling diversity compensates for reasoning unreliability.
Despite saturation, GSM8K remains useful as a baseline filter: models below 85% clearly have weak arithmetic reasoning, while models above 95% are not meaningfully ranked by this benchmark.
Controversies & Caveats
Known limitations and criticisms.
GSM8K is completely saturated: a 1% score difference at the 97–99% level corresponds to only 10 problems on the 1,000-problem test set — statistically meaningless for model ranking.
Apple's GSM-Symbolic paper (2024) showed that models scoring 95%+ on GSM8K drop to 65–70% when numbers are replaced with variables/symbols requiring the same reasoning — demonstrating that high scores partly reflect memorized solution patterns rather than genuine reasoning.
MR-GSM8K (meta-reasoning variant) asks models to identify errors in others' solutions; 95% GSM8K scorers often fail here — a different capability not captured by standard evaluation.
Training data contamination: GSM8K problems and chain-of-thought solutions are widely distributed on the internet and confirmed to appear in training corpora.