AIME
Annual olympiad-level math competition used as a fresh, contamination-proof AI benchmark.
What It Tests
AIME (American Invitational Mathematics Examination) is a real annual competition administered by the Mathematical Association of America since 1983. AI researchers adopted it as a benchmark because new problems are released each February — making each year's AIME impossible to include in training data before the competition runs. This gives AIME a unique property: it is perpetually fresh.
Each AIME consists of 15 problems with integer answers in the range 000–999. Problems require combining multiple mathematical concepts — number theory, combinatorics, geometry, algebra — in non-obvious ways. The competition qualifies the top 5% of AMC 10/12 scorers, meaning even successful human competitors typically solve only 4–6 of 15 problems.
The performance leap on AIME has been extraordinary: GPT-3-era models scored ~3% (near-random guessing). o1 (December 2024) scored 74.3% on AIME 2024. o3 reached 96.7% on AIME 2024. By 2025, frontier models routinely exceed the performance of virtually all human competitors.
Critically, scores drop 10-20 points when comparing AIME 2024 vs AIME 2025 for the same model — confirming some training contamination on older exams. This year-over-year delta is used to estimate contamination levels. AIME 2025 is the current clean reference, and fresh AIME problems each February will continue to provide contamination-proof evaluation.
Task Anatomy
How a single task is structured.
Example Tasks
3 real examples from the benchmark.
AIME 2024 I — Problem 1 (Easy tier)
Problem / Input
AIME answers are always 3-digit integers (000-999). Problems that seem straightforward require careful setup to avoid algebra errors.
AIME 2024 — Combinatorics problem (Mid-tier)
Problem / Input
Representative of AIME combinatorics: requires combining multiple counting techniques (combinations, derangements) correctly.
AIME 2025 — Number Theory (Hard tier)
Problem / Input
AIME number theory problems often require recognizing the structure by prime factorization and applying number-theoretic functions systematically.
Leaderboard Results
Model scores sorted by performance.
10 results
| # | Model | Score |
|---|---|---|
| 1 | o3 | V96.7% |
| 2 | GPT-5 | V94.6% |
| 3 | o4-mini | V93.4% |
| 4 | Gemini 3 Pro | V~93% |
| 5 | Grok 4 | V~88% |
| 6 | Gemini 2.5 Pro | V86.7% |
| 7 | Opus 4.5 | V80.0% |
| 8 | DeepSeek R1 | V79.8% |
| 9 | Qwen 3 72B | V~72% |
| 10 | GPT-4o | V~20% |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
The most dramatic capability demonstration in AI: GPT-3-era models at ~3% → o3 at 96.7% on AIME 2024. Extended reasoning (chain-of-thought within the model) is the key innovation that unlocked this.
AIME is uniquely contamination-resistant because new problems are created annually by competition writers who do not work with AI labs. The February release date gives a clean cutoff for evaluating post-cutoff performance.
The 10-20 point gap between AIME 2024 and AIME 2025 scores for the same model quantifies approximately how much 'contamination inflation' exists on older benchmarks.
AI models now score far above typical human competitor performance (top 5% of AMC scorers solve 4-6 of 15, or ~27-40%). Frontier reasoning models solve 13-15 of 15 routinely.
Controversies & Caveats
Known limitations and criticisms.
Year dependency: AIME 2024 scores are systematically ~10-15 points higher than AIME 2025 scores for the same model, confirming training contamination on older problems. 'AIME performance' is meaningless without specifying the year.
Tool-use inflation: o4-mini scores 99.5% on AIME 2025 with a Python interpreter. This is not comparable to closed-book reasoning and is often cited without the tool-use caveat.
Sample size: 30 problems per year means a 2-problem difference is a 6.7% swing — noisy for precise model comparisons. Some models report consensus@k (majority vote over 8+ samples) rather than pass@1.
Company-reported scores often use majority voting over many samples; independent evaluations use pass@1. These can differ by 10+ percentage points for the same model.