ReasoningActive

GPQA Diamond

PhD-level science questions so hard that even experts with Google still struggle.

Tasks198

Year2023

CreatorDavid Rein, Betty Li Hou, Asa Cooper Stickland, et al.

MetricAccuracy

Human Baseline81.3%

Random Chance25%

What It Tests

GPQA (Graduate-Level Google-Proof Q&A) was designed for a specific research problem: scalable oversight. If AI systems eventually exceed human expertise, how can human overseers verify their outputs? GPQA creates exactly the experimental condition needed — questions that domain experts can answer but non-experts cannot verify even with 30+ minutes of unrestricted internet access.

Created by David Rein and colleagues at NYU and Anthropic, published at COLM 2024, GPQA consists of 546 multiple-choice questions written and validated by PhD-level domain experts in biology, physics, and chemistry. The 'Google-proof' designation is empirically backed: non-expert PhD students searching the web for 30+ minutes averaged only 34% accuracy (random chance is 25%).

The Diamond subset (198 questions) is the most carefully filtered: a question enters only if two independent expert validators both answered correctly AND fewer than 1 in 3 non-expert validators answered correctly. These 198 questions represent the benchmark's gold standard.

The performance jump on GPQA has been dramatic: GPT-4 scored 38.8% in 2023 — barely above non-expert web-searching humans. Extended reasoning models (o1, o3) broke through expert-level performance in 2024-2025. Gemini 3.1 Pro Preview now scores 94.3% — far above the original expert validator accuracy of 65-81%. GPQA remains one of the best frontier model discriminators, with scores ranging from ~35% (weaker models) to ~94% across the full population.

Task Anatomy

How a single task is structured.

InputA multiple-choice question (4 options) requiring PhD-level domain expertise. Questions are written by PhD experts and designed so that the correct answer requires genuine understanding — not web search skills.

OutputModel selects one of 4 answer choices (A, B, C, or D).

EvaluationAccuracy: % of 198 questions answered correctly. Typically evaluated with chain-of-thought prompting (0-shot or few-shot CoT). Random chance = 25%. Non-expert humans with web access = 34%.

MetricAccuracy

Example Tasks

3 real examples from the benchmark.

Organic Chemistry — Diels-Alder reaction

Hard

Problem / Input

Methylcyclopentadiene (which exists as a fluxional mixture of isomers) was allowed to react with methyl isoamyl ketone and a catalytic amount of pyrrolidine. A bright yellow, cross-conjugated polyalkenyl hydrocarbon product formed (as a mixture of isomers), with water as a side product. These products are derivatives of fulvene. This product was then allowed to react with ethyl acrylate in a 1:1 ratio. Upon completion of the reaction, the bright yellow color had disappeared. How many chemically distinct isomers make up the final product (not counting stereoisomers)? A) 2 B) 16 C) 8 D) 4

AnswerB) 16

Requires understanding Diels-Alder reactions, fluxional isomerism, and systematic counting of structural isomers across 4 diene isomers × 4 reaction geometries. This is why non-experts with web access only get 34% — the question requires integrating multiple advanced chemistry concepts.

General Chemistry — Industrial catalysis

Hard

Problem / Input

A reaction of a liquid organic compound (molecules consisting only of C and H) is performed at 80°C and 20 bar for 24 hours. In the proton NMR spectrum, the signals with the highest chemical shift of the reactant are replaced by a signal about 3-4 units downfield in the product.

Compounds from which position in the periodic table (also used in the corresponding large-scale industrial process) were most likely initially added in small amounts?

A) A metal compound from the fifth period.
B) A metal compound from the fifth period and a non-metal compound from the third period.
C) A metal compound from the fourth period.
D) A metal compound from the fourth period and a non-metal compound from the second period.

AnswerD) A metal compound from the fourth period and a non-metal compound from the second period.

Requires recognizing the reaction type from NMR data + conditions, then mapping to the industrial catalyst system (Co = period 4 metal, CO = period 2 non-metal for oxo synthesis).

Quantum Mechanics — Energy uncertainty

Hard

Problem / Input

Two quantum states with energies E₁ and E₂ have lifetimes of τ₁ = 10⁻⁹ s and τ₂ = 10⁻⁸ s respectively. Which of the following energy differences ΔE = E₂ - E₁ could NOT be clearly resolved experimentally? A) 10⁻²⁵ J B) 10⁻²⁶ J C) 10⁻²⁷ J D) 10⁻²⁸ J

AnswerD) 10⁻²⁸ J

Requires correctly applying the energy-time uncertainty principle quantitatively, not just qualitatively. The calculation must be done correctly with the right form of the Heisenberg relation.

Leaderboard Results

Model scores sorted by performance.

11 results

Sort:

#	Model	Score	Setup	Date
1	Gemini 3 ProGoogle	94.3%	Gemini 3.1 Pro Preview, 0-shot CoT	2026-02
2	Opus 4.6Anthropic	V~88%	—	2026-02
3	o3OpenAI	V87.7%	—	2025-04
4	Grok 4xAI	V~85%	—	2025-07
5	Gemini 2.5 ProGoogle	V~83%	—	2025-03
6	o4-miniOpenAI	V81.4%	—	2025-04
7	GPT-5OpenAI	V~75%	—	2025-08
8	DeepSeek R1DeepSeek	V~71%	—	2025-01
9	Sonnet 4Anthropic	V~68%	—	2025-08
10	Llama 4 MaverickMeta	V~60%	—	2025-04
11	GPT-4oOpenAI	V~53%	—	2024-11

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

The most dramatic single-year AI progress story: GPT-4 scored 38.8% in 2023 (barely above 34% non-expert web search). o3 scored 87.7% in 2025. Frontier models went from sub-expert to expert-surpassing in 2 years.
GPQA Diamond is currently the best frontier model discriminator among science benchmarks, with 56 percentage points of spread between GPT-4-era models (~38%) and Gemini 3.1 Pro (~94%).
Extended reasoning is essential: o1-preview jumped ~24 percentage points over GPT-4o by using internal chain-of-thought — demonstrating that the bottleneck was reasoning process, not knowledge.
The benchmark validates the scalable oversight concern it was designed to study: if AI systems score 94% on questions non-experts can't verify, human oversight of AI science work becomes fundamentally difficult.

Controversies & Caveats

Known limitations and criticisms.

⚠

The dataset is password-protected to limit contamination (password: 'deserted-untie-orchid') but is published openly enough that contamination is possible for models trained after mid-2024.

⚠

With only 198 questions, a 5-question difference = ~2.5% — confidence intervals are wide. Model rankings can shift with small measurement noise.

⚠

Chemistry is overrepresented (~36% organic chemistry) — this may favor models with disproportionate chemistry training data.

⚠

Expert accuracy figures (65-81%) are for PhD-level validators working in their own subfield with no time pressure. AI systems now clearly surpass this, but the comparison is imperfect since human experts had no computational assistance.

⚠

Anthropic contributed to the benchmark's creation, raising concerns about whether the benchmark might favor Claude models — though empirical results show Gemini leading.

Links

arXiv Paper ↗Official Leaderboard ↗Dataset ↗GitHub ↗