MMLU / MMLU-Pro
Broad academic knowledge across 57 subjects — the standard knowledge benchmark.
What It Tests
MMLU (Massive Multitask Language Understanding) was created by Dan Hendrycks and colleagues at UC Berkeley and published at ICLR 2021. It tests the breadth of factual and reasoning knowledge across 57 academic subjects spanning STEM, humanities, social sciences, law, medicine, and professional fields. Questions were sourced from real exams: academic textbooks, practice tests, licensing exams (USMLE, bar exam), and online educational platforms.
With 14,042 multiple-choice questions across 57 subjects, MMLU was the first benchmark to systematically evaluate AI across the full breadth of human academic knowledge. When introduced, GPT-3 scored only 43.9% — barely above random chance (25%). The benchmark became the standard 'broad capability' evaluation for LLMs throughout 2021-2024.
However, MMLU is now effectively saturated at the frontier. Top models score 90-92%, exceeding estimated human expert performance (~89.8%). Differences between the top models correspond to fewer than 30 questions on the 14,042-question test. The benchmark also has known quality issues: ~6.5% of questions contain errors or ambiguities.
MMLU-Pro, released at NeurIPS 2024, extends MMLU with 10 answer choices instead of 4 (reducing random chance from 25% to 10%), more reasoning-focused questions, and greater prompt-variation stability. Models drop 16-26 percentage points moving from MMLU to MMLU-Pro, providing renewed discrimination. However, MMLU-Pro is itself approaching saturation for frontier models, with top scores reaching 87-90%.
Task Anatomy
How a single task is structured.
Example Tasks
3 real examples from the benchmark.
High School Mathematics
Problem / Input
Representative STEM question. Tests basic algebraic function evaluation.
Professional Medicine
Problem / Input
A 45-year-old man presents with progressive weakness of proximal muscles, difficulty rising from a chair, and a heliotrope rash around the eyes. The most likely diagnosis is:
A) Systemic lupus erythematosus
B) Dermatomyositis
C) Polymyalgia rheumatica
D) Rheumatoid arthritisMedical knowledge requiring pattern recognition of a specific clinical presentation. These questions are sourced from USMLE-style practice exams.
International Law
Problem / Input
Legal knowledge requiring memorization of a specific treaty provision and its correct interpretation.
Leaderboard Results
Model scores sorted by performance.
12 results
| # | Model | Score |
|---|---|---|
| 1 | Grok 4 | V92.7% |
| 2 | Opus 4.5 | V91.8% |
| 3 | GPT-5 | V91.4% |
| 4 | DeepSeek R1 | V90.8% |
| 5 | Gemini 2.5 Pro | V~90% |
| 6 | Qwen 3 72B | V88.7% |
| 7 | GPT-4o | V88.7% |
| 8 | DeepSeek V3 | V88.2% |
| 9 | Llama 4 Maverick | V87.8% |
| 10 | Mistral Large 3 | V87.1% |
| 11 | M2.5 | V~86.5% |
| 12 | Kimi K2.6 | V~86% |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
Score progression from GPT-3 (43.9%) to GPT-5 (91.4%) over 4 years is one of AI's clearest demonstrations of scaling — from barely above random to exceeding human expert performance.
MMLU-Pro's 10-answer format revealed that most MMLU improvements were inflated: models that drop 18-26% points moving to 10 choices weren't as knowledgeable as their MMLU scores suggested.
The benchmark established that breadth of academic knowledge scales with model size — a counterintuitive finding in 2020 when memorization at scale was not taken seriously.
Human expert performance estimate (~89.8%) is now exceeded by multiple frontier models, making human comparison no longer meaningful for this benchmark.
Variants & Related
MMLU-Pro
10 answer choices instead of 4; more reasoning-focused questions; greater stability under prompt variation. Models drop 16-26 percentage points from MMLU to MMLU-Pro. Top score: Gemini 3 Pro at ~89.8%.
Global-MMLU
Extended to 42 languages. Tests whether multilingual models maintain performance across non-English languages on the same academic knowledge tasks.
Controversies & Caveats
Known limitations and criticisms.
~6.5% of MMLU questions contain identifiable errors or ambiguities. The virology subject alone has 57% of questions with identified problems, setting a realistic ceiling around 93-94% rather than 100%.
The standard 5-shot evaluation is sensitive to which 5 examples are selected; different implementations give ±2% score differences for the same model.
Some subjects (global facts, miscellaneous) have contested 'correct' answers that reflect US-centric or culturally specific norms, disadvantaging models with different training distributions.
Severe contamination: with 14,042 questions publicly available since 2020, most training corpora contain MMLU questions and answers. High scores may reflect memorization.