benchmark.darvinyi.com
← Back to Benchmarks
KnowledgeSaturated

MMLU / MMLU-Pro

Broad academic knowledge across 57 subjects — the standard knowledge benchmark.

Tasks14,042
Year2020
CreatorDan Hendrycks, Collin Burns, Steven Basart, et al.
MetricAccuracy
Human Baseline89.8%
Random Chance25%

What It Tests

MMLU (Massive Multitask Language Understanding) was created by Dan Hendrycks and colleagues at UC Berkeley and published at ICLR 2021. It tests the breadth of factual and reasoning knowledge across 57 academic subjects spanning STEM, humanities, social sciences, law, medicine, and professional fields. Questions were sourced from real exams: academic textbooks, practice tests, licensing exams (USMLE, bar exam), and online educational platforms.

With 14,042 multiple-choice questions across 57 subjects, MMLU was the first benchmark to systematically evaluate AI across the full breadth of human academic knowledge. When introduced, GPT-3 scored only 43.9% — barely above random chance (25%). The benchmark became the standard 'broad capability' evaluation for LLMs throughout 2021-2024.

However, MMLU is now effectively saturated at the frontier. Top models score 90-92%, exceeding estimated human expert performance (~89.8%). Differences between the top models correspond to fewer than 30 questions on the 14,042-question test. The benchmark also has known quality issues: ~6.5% of questions contain errors or ambiguities.

MMLU-Pro, released at NeurIPS 2024, extends MMLU with 10 answer choices instead of 4 (reducing random chance from 25% to 10%), more reasoning-focused questions, and greater prompt-variation stability. Models drop 16-26 percentage points moving from MMLU to MMLU-Pro, providing renewed discrimination. However, MMLU-Pro is itself approaching saturation for frontier models, with top scores reaching 87-90%.

Task Anatomy

How a single task is structured.

InputA multiple-choice question with 4 options (MMLU) or 10 options (MMLU-Pro) from one of 57 academic subjects. Evaluated with 5-shot prompting: 5 solved examples from the same subject are prepended.
OutputModel selects one answer choice.
EvaluationAccuracy (% correct). Standard 5-shot evaluation. Random chance = 25% (MMLU) or 10% (MMLU-Pro).
MetricAccuracy

Example Tasks

3 real examples from the benchmark.

#1

High School Mathematics

Medium

Problem / Input

If f(x) = x² - 3x + 2, what is f(5)? A) 8 B) 10 C) 12 D) 16
AnswerC) 12

Representative STEM question. Tests basic algebraic function evaluation.

#2

Professional Medicine

Hard

Problem / Input

A 45-year-old man presents with progressive weakness of proximal muscles, difficulty rising from a chair, and a heliotrope rash around the eyes. The most likely diagnosis is:

A) Systemic lupus erythematosus
B) Dermatomyositis
C) Polymyalgia rheumatica
D) Rheumatoid arthritis
AnswerB) Dermatomyositis

Medical knowledge requiring pattern recognition of a specific clinical presentation. These questions are sourced from USMLE-style practice exams.

#3

International Law

Hard

Problem / Input

Under Article 38 of the Statute of the International Court of Justice, which of the following is NOT a primary source of international law? A) International conventions B) International custom C) Judicial decisions D) General principles of law
AnswerC) Judicial decisions

Legal knowledge requiring memorization of a specific treaty provision and its correct interpretation.

Leaderboard Results

Model scores sorted by performance.

12 results

Sort:
#ModelScore
1
Grok 4
V92.7%
2
Opus 4.5
V91.8%
3
GPT-5
V91.4%
4
DeepSeek R1
V90.8%
5
Gemini 2.5 Pro
V~90%
6
Qwen 3 72B
V88.7%
7
GPT-4o
V88.7%
8
DeepSeek V3
V88.2%
9
Llama 4 Maverick
V87.8%
10
Mistral Large 3
V87.1%
11
M2.5
V~86.5%
12
Kimi K2.6
V~86%

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

  • Score progression from GPT-3 (43.9%) to GPT-5 (91.4%) over 4 years is one of AI's clearest demonstrations of scaling — from barely above random to exceeding human expert performance.

  • MMLU-Pro's 10-answer format revealed that most MMLU improvements were inflated: models that drop 18-26% points moving to 10 choices weren't as knowledgeable as their MMLU scores suggested.

  • The benchmark established that breadth of academic knowledge scales with model size — a counterintuitive finding in 2020 when memorization at scale was not taken seriously.

  • Human expert performance estimate (~89.8%) is now exceeded by multiple frontier models, making human comparison no longer meaningful for this benchmark.

Variants & Related

MMLU-Pro

12,000 tasksNearing Saturation

10 answer choices instead of 4; more reasoning-focused questions; greater stability under prompt variation. Models drop 16-26 percentage points from MMLU to MMLU-Pro. Top score: Gemini 3 Pro at ~89.8%.

Global-MMLU

420,000 tasksActive

Extended to 42 languages. Tests whether multilingual models maintain performance across non-English languages on the same academic knowledge tasks.

Controversies & Caveats

Known limitations and criticisms.

~6.5% of MMLU questions contain identifiable errors or ambiguities. The virology subject alone has 57% of questions with identified problems, setting a realistic ceiling around 93-94% rather than 100%.

The standard 5-shot evaluation is sensitive to which 5 examples are selected; different implementations give ±2% score differences for the same model.

Some subjects (global facts, miscellaneous) have contested 'correct' answers that reflect US-centric or culturally specific norms, disadvantaging models with different training distributions.

Severe contamination: with 14,042 questions publicly available since 2020, most training corpora contain MMLU questions and answers. High scores may reflect memorization.

Links