MathSaturated

GSM8K

Grade-school math word problems requiring 2-8 step arithmetic reasoning.

Tasks8,500

Year2021

CreatorKarl Cobbe, Vineet Kosaraju, Mohammad Bavarian, et al.

MetricAccuracy (% of problems with correct final numeric answer)

Random Chance0%

What It Tests

GSM8K (Grade School Math 8K) was created by OpenAI in 2021 to diagnose failures in multi-step mathematical reasoning. The insight was that even GPT-3 (175B parameters) — despite its scale — couldn't reliably solve problems that a bright 12-year-old could. The benchmark requires only basic arithmetic (+, -, ×, ÷) but chains 2–8 steps together in natural language, which was surprisingly difficult for 2021-era models.

The paper that introduced GSM8K also introduced verifier models — separate models trained to score candidate solutions — as a key technique for improving math performance. This idea later influenced RLHF development and process reward models.

GSM8K became the canonical benchmark for chain-of-thought (CoT) prompting research. The landmark Wei et al. (2022) paper demonstrated that CoT prompting gave a ~50% performance boost on GSM8K, establishing chain-of-thought as a fundamental technique. PaLM's self-consistency method (majority voting over multiple CoT samples) further boosted performance to 74.4%.

GSM8K is completely saturated for frontier models. GPT-5 scores 99.7%, with multiple models above 98%. The benchmark is now used only as a baseline floor check — models scoring below 90% are clearly not competitive on math tasks. For frontier discrimination, researchers use AIME (annual competition problems that can't be contaminated).

Task Anatomy

How a single task is structured.

InputA natural-language word problem requiring 2-8 arithmetic steps to solve. No algebra, no calculus — just addition, subtraction, multiplication, and division applied to real-world scenarios.

OutputModel generates a free-form solution with reasoning steps, ending with the final numeric answer after '####'.

EvaluationExact match on the final numeric answer (extracted by splitting on '####'). Solutions may include calculator annotations like <<3+4=7>> to show intermediate calculations.

MetricAccuracy (% of problems with correct final numeric answer)

Example Tasks

3 real examples from the benchmark.

Simple — Duck eggs

Easy (2 steps)

Problem / Input

Janet's ducks lay 16 eggs per day. She eats 3 for breakfast every morning and bakes muffins for her friends every day with 4 eggs. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?

Answer#### 18

This is the first problem in the dataset. Represents the easiest tier — simple sequential arithmetic.

Simple — Robe fiber

Easy (2 steps)

Problem / Input

A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?

Answer#### 3

Tests fraction understanding ('half that much') combined with addition.

Multi-step — House flip

Medium (5 steps)

Problem / Input

Josh decides to try flipping a house. He buys a house for $80,000 and then puts in $50,000 in repairs. This increased the value of the house by 150%. How much profit did he make?

Answer#### 70000

Multi-step with percentage increase: requires recognizing that '150% increase' means multiplying the original price by 1.5, then adding to the original. A common stumbling block for weaker models.

Leaderboard Results

Model scores sorted by performance.

9 results

Sort:

#	Model	Score	Setup	Date
1	GPT-5OpenAI	V99.7%	—	2025-08
2	o3OpenAI	V99.2%	—	2025-04
3	Opus 4.5Anthropic	V~98.5%	—	2025-11
4	Gemini 2.5 ProGoogle	V~98%	—	2025-03
5	DeepSeek R1DeepSeek	V97.7%	—	2025-01
6	Qwen 3 72BAlibaba	V~97%	—	2025-05
7	GPT-4oOpenAI	V96.1%	—	2024-11
8	Llama 4 MaverickMeta	V~95%	—	2025-04
9	Mistral Large 3Mistral	V~93%	—	2025-06

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

Score progression: GPT-3 (34.7%) → PaLM with CoT (56.5%) → GPT-4 (92%) → GPT-5 (99.7%). Complete saturation arc in 4 years.
This benchmark proved chain-of-thought prompting works: PaLM's GSM8K score jumped ~30 points when models were prompted to show their work step-by-step, establishing CoT as a fundamental technique.
Self-consistency (majority voting over multiple CoT samples) further boosted PaLM from 56.5% to 74.4% — demonstrating that sampling diversity compensates for reasoning unreliability.
Despite saturation, GSM8K remains useful as a baseline filter: models below 85% clearly have weak arithmetic reasoning, while models above 95% are not meaningfully ranked by this benchmark.

Controversies & Caveats

Known limitations and criticisms.

⚠

GSM8K is completely saturated: a 1% score difference at the 97–99% level corresponds to only 10 problems on the 1,000-problem test set — statistically meaningless for model ranking.

⚠

Apple's GSM-Symbolic paper (2024) showed that models scoring 95%+ on GSM8K drop to 65–70% when numbers are replaced with variables/symbols requiring the same reasoning — demonstrating that high scores partly reflect memorized solution patterns rather than genuine reasoning.

⚠

MR-GSM8K (meta-reasoning variant) asks models to identify errors in others' solutions; 95% GSM8K scorers often fail here — a different capability not captured by standard evaluation.

⚠

Training data contamination: GSM8K problems and chain-of-thought solutions are widely distributed on the internet and confirmed to appear in training corpora.

Links

arXiv Paper ↗Official Leaderboard ↗Dataset ↗