GPQA Diamond
PhD-level science questions so hard that even experts with Google still struggle.
What It Tests
GPQA (Graduate-Level Google-Proof Q&A) was designed for a specific research problem: scalable oversight. If AI systems eventually exceed human expertise, how can human overseers verify their outputs? GPQA creates exactly the experimental condition needed — questions that domain experts can answer but non-experts cannot verify even with 30+ minutes of unrestricted internet access.
Created by David Rein and colleagues at NYU and Anthropic, published at COLM 2024, GPQA consists of 546 multiple-choice questions written and validated by PhD-level domain experts in biology, physics, and chemistry. The 'Google-proof' designation is empirically backed: non-expert PhD students searching the web for 30+ minutes averaged only 34% accuracy (random chance is 25%).
The Diamond subset (198 questions) is the most carefully filtered: a question enters only if two independent expert validators both answered correctly AND fewer than 1 in 3 non-expert validators answered correctly. These 198 questions represent the benchmark's gold standard.
The performance jump on GPQA has been dramatic: GPT-4 scored 38.8% in 2023 — barely above non-expert web-searching humans. Extended reasoning models (o1, o3) broke through expert-level performance in 2024-2025. Gemini 3.1 Pro Preview now scores 94.3% — far above the original expert validator accuracy of 65-81%. GPQA remains one of the best frontier model discriminators, with scores ranging from ~35% (weaker models) to ~94% across the full population.
Task Anatomy
How a single task is structured.
Example Tasks
3 real examples from the benchmark.
Organic Chemistry — Diels-Alder reaction
Problem / Input
Requires understanding Diels-Alder reactions, fluxional isomerism, and systematic counting of structural isomers across 4 diene isomers × 4 reaction geometries. This is why non-experts with web access only get 34% — the question requires integrating multiple advanced chemistry concepts.
General Chemistry — Industrial catalysis
Problem / Input
A reaction of a liquid organic compound (molecules consisting only of C and H) is performed at 80°C and 20 bar for 24 hours. In the proton NMR spectrum, the signals with the highest chemical shift of the reactant are replaced by a signal about 3-4 units downfield in the product.
Compounds from which position in the periodic table (also used in the corresponding large-scale industrial process) were most likely initially added in small amounts?
A) A metal compound from the fifth period.
B) A metal compound from the fifth period and a non-metal compound from the third period.
C) A metal compound from the fourth period.
D) A metal compound from the fourth period and a non-metal compound from the second period.Requires recognizing the reaction type from NMR data + conditions, then mapping to the industrial catalyst system (Co = period 4 metal, CO = period 2 non-metal for oxo synthesis).
Quantum Mechanics — Energy uncertainty
Problem / Input
Requires correctly applying the energy-time uncertainty principle quantitatively, not just qualitatively. The calculation must be done correctly with the right form of the Heisenberg relation.
Leaderboard Results
Model scores sorted by performance.
11 results
| # | Model | Score |
|---|---|---|
| 1 | Gemini 3 Pro | 94.3% |
| 2 | Opus 4.6 | V~88% |
| 3 | o3 | V87.7% |
| 4 | Grok 4 | V~85% |
| 5 | Gemini 2.5 Pro | V~83% |
| 6 | o4-mini | V81.4% |
| 7 | GPT-5 | V~75% |
| 8 | DeepSeek R1 | V~71% |
| 9 | Sonnet 4 | V~68% |
| 10 | Llama 4 Maverick | V~60% |
| 11 | GPT-4o | V~53% |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
The most dramatic single-year AI progress story: GPT-4 scored 38.8% in 2023 (barely above 34% non-expert web search). o3 scored 87.7% in 2025. Frontier models went from sub-expert to expert-surpassing in 2 years.
GPQA Diamond is currently the best frontier model discriminator among science benchmarks, with 56 percentage points of spread between GPT-4-era models (~38%) and Gemini 3.1 Pro (~94%).
Extended reasoning is essential: o1-preview jumped ~24 percentage points over GPT-4o by using internal chain-of-thought — demonstrating that the bottleneck was reasoning process, not knowledge.
The benchmark validates the scalable oversight concern it was designed to study: if AI systems score 94% on questions non-experts can't verify, human oversight of AI science work becomes fundamentally difficult.
Controversies & Caveats
Known limitations and criticisms.
The dataset is password-protected to limit contamination (password: 'deserted-untie-orchid') but is published openly enough that contamination is possible for models trained after mid-2024.
With only 198 questions, a 5-question difference = ~2.5% — confidence intervals are wide. Model rankings can shift with small measurement noise.
Chemistry is overrepresented (~36% organic chemistry) — this may favor models with disproportionate chemistry training data.
Expert accuracy figures (65-81%) are for PhD-level validators working in their own subfield with no time pressure. AI systems now clearly surpass this, but the comparison is imperfect since human experts had no computational assistance.
Anthropic contributed to the benchmark's creation, raising concerns about whether the benchmark might favor Claude models — though empirical results show Gemini leading.