benchmark.darvinyi.com
← Back to Benchmarks
MathSaturated

GSM8K

Grade-school math word problems requiring 2-8 step arithmetic reasoning.

Tasks8,500
Year2021
CreatorKarl Cobbe, Vineet Kosaraju, Mohammad Bavarian, et al.
MetricAccuracy (% of problems with correct final numeric answer)
Random Chance0%

What It Tests

GSM8K (Grade School Math 8K) was created by OpenAI in 2021 to diagnose failures in multi-step mathematical reasoning. The insight was that even GPT-3 (175B parameters) — despite its scale — couldn't reliably solve problems that a bright 12-year-old could. The benchmark requires only basic arithmetic (+, -, ×, ÷) but chains 2–8 steps together in natural language, which was surprisingly difficult for 2021-era models.

The paper that introduced GSM8K also introduced verifier models — separate models trained to score candidate solutions — as a key technique for improving math performance. This idea later influenced RLHF development and process reward models.

GSM8K became the canonical benchmark for chain-of-thought (CoT) prompting research. The landmark Wei et al. (2022) paper demonstrated that CoT prompting gave a ~50% performance boost on GSM8K, establishing chain-of-thought as a fundamental technique. PaLM's self-consistency method (majority voting over multiple CoT samples) further boosted performance to 74.4%.

GSM8K is completely saturated for frontier models. GPT-5 scores 99.7%, with multiple models above 98%. The benchmark is now used only as a baseline floor check — models scoring below 90% are clearly not competitive on math tasks. For frontier discrimination, researchers use AIME (annual competition problems that can't be contaminated).

Task Anatomy

How a single task is structured.

InputA natural-language word problem requiring 2-8 arithmetic steps to solve. No algebra, no calculus — just addition, subtraction, multiplication, and division applied to real-world scenarios.
OutputModel generates a free-form solution with reasoning steps, ending with the final numeric answer after '####'.
EvaluationExact match on the final numeric answer (extracted by splitting on '####'). Solutions may include calculator annotations like <<3+4=7>> to show intermediate calculations.
MetricAccuracy (% of problems with correct final numeric answer)

Example Tasks

3 real examples from the benchmark.

#1

Simple — Duck eggs

Easy (2 steps)

Problem / Input

Janet's ducks lay 16 eggs per day. She eats 3 for breakfast every morning and bakes muffins for her friends every day with 4 eggs. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?
Answer#### 18

This is the first problem in the dataset. Represents the easiest tier — simple sequential arithmetic.

#2

Simple — Robe fiber

Easy (2 steps)

Problem / Input

A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?
Answer#### 3

Tests fraction understanding ('half that much') combined with addition.

#3

Multi-step — House flip

Medium (5 steps)

Problem / Input

Josh decides to try flipping a house. He buys a house for $80,000 and then puts in $50,000 in repairs. This increased the value of the house by 150%. How much profit did he make?
Answer#### 70000

Multi-step with percentage increase: requires recognizing that '150% increase' means multiplying the original price by 1.5, then adding to the original. A common stumbling block for weaker models.

Leaderboard Results

Model scores sorted by performance.

9 results

Sort:
#ModelScore
1
GPT-5
V99.7%
2
o3
V99.2%
3
Opus 4.5
V~98.5%
4
Gemini 2.5 Pro
V~98%
5
DeepSeek R1
V97.7%
6
Qwen 3 72B
V~97%
7
GPT-4o
V96.1%
8
Llama 4 Maverick
V~95%
9
Mistral Large 3
V~93%

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

  • Score progression: GPT-3 (34.7%) → PaLM with CoT (56.5%) → GPT-4 (92%) → GPT-5 (99.7%). Complete saturation arc in 4 years.

  • This benchmark proved chain-of-thought prompting works: PaLM's GSM8K score jumped ~30 points when models were prompted to show their work step-by-step, establishing CoT as a fundamental technique.

  • Self-consistency (majority voting over multiple CoT samples) further boosted PaLM from 56.5% to 74.4% — demonstrating that sampling diversity compensates for reasoning unreliability.

  • Despite saturation, GSM8K remains useful as a baseline filter: models below 85% clearly have weak arithmetic reasoning, while models above 95% are not meaningfully ranked by this benchmark.

Controversies & Caveats

Known limitations and criticisms.

GSM8K is completely saturated: a 1% score difference at the 97–99% level corresponds to only 10 problems on the 1,000-problem test set — statistically meaningless for model ranking.

Apple's GSM-Symbolic paper (2024) showed that models scoring 95%+ on GSM8K drop to 65–70% when numbers are replaced with variables/symbols requiring the same reasoning — demonstrating that high scores partly reflect memorized solution patterns rather than genuine reasoning.

MR-GSM8K (meta-reasoning variant) asks models to identify errors in others' solutions; 95% GSM8K scorers often fail here — a different capability not captured by standard evaluation.

Training data contamination: GSM8K problems and chain-of-thought solutions are widely distributed on the internet and confirmed to appear in training corpora.

Links