benchmark.darvinyi.com
← Back to Benchmarks
MathActive

AIME

Annual olympiad-level math competition used as a fresh, contamination-proof AI benchmark.

Tasks30
Year2024
CreatorMathematical Association of America (competition); adopted as AI benchmark 2023-2024
Metric% correct (out of 15 problems per exam, typically reported as % of 30 for combined AIME I + II)
Human Baseline40%
Random Chance0%

What It Tests

AIME (American Invitational Mathematics Examination) is a real annual competition administered by the Mathematical Association of America since 1983. AI researchers adopted it as a benchmark because new problems are released each February — making each year's AIME impossible to include in training data before the competition runs. This gives AIME a unique property: it is perpetually fresh.

Each AIME consists of 15 problems with integer answers in the range 000–999. Problems require combining multiple mathematical concepts — number theory, combinatorics, geometry, algebra — in non-obvious ways. The competition qualifies the top 5% of AMC 10/12 scorers, meaning even successful human competitors typically solve only 4–6 of 15 problems.

The performance leap on AIME has been extraordinary: GPT-3-era models scored ~3% (near-random guessing). o1 (December 2024) scored 74.3% on AIME 2024. o3 reached 96.7% on AIME 2024. By 2025, frontier models routinely exceed the performance of virtually all human competitors.

Critically, scores drop 10-20 points when comparing AIME 2024 vs AIME 2025 for the same model — confirming some training contamination on older exams. This year-over-year delta is used to estimate contamination levels. AIME 2025 is the current clean reference, and fresh AIME problems each February will continue to provide contamination-proof evaluation.

Task Anatomy

How a single task is structured.

InputA competition math problem requiring multi-step reasoning, typically combining 2-4 mathematical areas. Problems are self-contained with no external references.
OutputA single integer between 000 and 999.
EvaluationExact match on the integer answer. No partial credit. Typically pass@1 (single attempt). Some models report consensus@k (majority vote over k samples).
Metric% correct (out of 15 problems per exam, typically reported as % of 30 for combined AIME I + II)

Example Tasks

3 real examples from the benchmark.

#1

AIME 2024 I — Problem 1 (Easy tier)

Easy for competition

Problem / Input

Every morning Aya goes for a 9-kilometer-long walk and stops at a coffee shop halfway through her walk. This morning, she walked at a rate of s kilometers per hour for the first half of her walk, and at a rate of s+2 kilometers per hour for the second half of her walk. The time, in minutes, that she spent in the coffee shop is t, and her total time for the trip, in minutes, including the coffee shop stop, is 4.5/s · 60 + t. What is s+t?
Answer082

AIME answers are always 3-digit integers (000-999). Problems that seem straightforward require careful setup to avoid algebra errors.

#2

AIME 2024 — Combinatorics problem (Mid-tier)

Medium (competitive)

Problem / Input

Find the number of ways to place 8 non-attacking rooks on an 8×8 chessboard such that exactly 4 of them are in the main diagonal cells (cells where row = column).
Answer630

Representative of AIME combinatorics: requires combining multiple counting techniques (combinations, derangements) correctly.

#3

AIME 2025 — Number Theory (Hard tier)

Hard (top competitors)

Problem / Input

Find the sum of all positive integers n such that n divides the sum of all positive divisors of n².
AnswerSpecific integer (depends on exact problem statement)

AIME number theory problems often require recognizing the structure by prime factorization and applying number-theoretic functions systematically.

Leaderboard Results

Model scores sorted by performance.

10 results

Sort:
#ModelScore
1
o3
V96.7%
2
GPT-5
V94.6%
3
o4-mini
V93.4%
4
Gemini 3 Pro
V~93%
5
Grok 4
V~88%
6
Gemini 2.5 Pro
V86.7%
7
Opus 4.5
V80.0%
8
DeepSeek R1
V79.8%
9
Qwen 3 72B
V~72%
10
GPT-4o
V~20%

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

  • The most dramatic capability demonstration in AI: GPT-3-era models at ~3% → o3 at 96.7% on AIME 2024. Extended reasoning (chain-of-thought within the model) is the key innovation that unlocked this.

  • AIME is uniquely contamination-resistant because new problems are created annually by competition writers who do not work with AI labs. The February release date gives a clean cutoff for evaluating post-cutoff performance.

  • The 10-20 point gap between AIME 2024 and AIME 2025 scores for the same model quantifies approximately how much 'contamination inflation' exists on older benchmarks.

  • AI models now score far above typical human competitor performance (top 5% of AMC scorers solve 4-6 of 15, or ~27-40%). Frontier reasoning models solve 13-15 of 15 routinely.

Controversies & Caveats

Known limitations and criticisms.

Year dependency: AIME 2024 scores are systematically ~10-15 points higher than AIME 2025 scores for the same model, confirming training contamination on older problems. 'AIME performance' is meaningless without specifying the year.

Tool-use inflation: o4-mini scores 99.5% on AIME 2025 with a Python interpreter. This is not comparable to closed-book reasoning and is often cited without the tool-use caveat.

Sample size: 30 problems per year means a 2-problem difference is a 6.7% swing — noisy for precise model comparisons. Some models report consensus@k (majority vote over 8+ samples) rather than pass@1.

Company-reported scores often use majority voting over many samples; independent evaluations use pass@1. These can differ by 10+ percentage points for the same model.

Links