ReasoningNearing Saturation

BIG-Bench Hard

23 hard reasoning tasks where chain-of-thought is required to exceed human performance.

Tasks6,511

Year2022

CreatorMirac Suzgun, Nathan Scales, Nathanael Schärli, et al.

MetricNormalized accuracy (average across all 23 tasks)

Human Baseline50%

What It Tests

BIG-Bench Hard (BBH) is a curated subset of the larger BIG-Bench collaborative benchmark (200+ tasks from 100+ organizations). The BBH authors identified the 23 tasks from BIG-Bench where no language model exceeded average human performance even with few-shot prompting — selecting tasks that specifically required complex reasoning rather than knowledge retrieval.

Created by Mirac Suzgun and colleagues at Google and published at EMNLP 2023, BBH's central finding was that chain-of-thought (CoT) prompting specifically unlocked performance on these tasks. Without CoT, even large models scored near-randomly. With 3-shot CoT, Codex exceeded human performance on 17/23 tasks and PaLM on 10/23. This empirically demonstrated that CoT is not just a minor prompting trick but a fundamental capability enabler.

BBH covers 23 diverse tasks: Boolean logic, logical deduction (3/5/7 objects), tracking shuffled objects, causal judgment, date understanding, dyck language completion, formal fallacies, geometric shapes, hyperbaton, movie recommendation, multi-step arithmetic, navigation, object counting, penguins in a table, colored object reasoning, snarks, sports understanding, temporal sequences, web of lies, and word sorting.

BBH is now approaching saturation for frontier models, with top scores at 87-93%. Google DeepMind released BIG-Bench Extra Hard (BBEH) in February 2025 specifically because frontier models had saturated BBH. BBH remains useful for evaluating mid-tier models and as a regression test.

Task Anatomy

How a single task is structured.

InputA problem from one of 23 task categories, formatted as a few-shot prompt with 3 chain-of-thought examples preceding the test question.

OutputAn exact string answer (varies by task: 'True'/'False', an object name, a number, a sorted list, etc.).

EvaluationExact match against the ground truth string. No partial credit. Standard setup: 3-shot chain-of-thought.

MetricNormalized accuracy (average across all 23 tasks)

Example Tasks

4 real examples from the benchmark.

Boolean Expressions

Medium

Problem / Input

Q: not ( ( not not True ) ) is A: Let's think step by step.

AnswerFalse

Without CoT, models struggle to track the nested negation correctly. With step-by-step reasoning, performance jumps significantly.

Web of Lies

Hard

Problem / Input

Q: Lena tells the truth. Lena says Mike tells the truth. Mike says Olivia tells the truth. Olivia says Petra lies. Petra says Quinn tells the truth. Quinn says Rosa lies. Does Rosa tell the truth? A: Let's think step by step.

AnswerYes

Requires tracking truth values through a 6-step chain. Each step flips or preserves the truth status based on whether the speaker is reliable. Without CoT, models lose track of the chain.

Tracking Shuffled Objects (3 objects)

Medium

Problem / Input

Q: Alice, Bob, and Claire are playing a game. At the start, Alice has the apple, Bob has the ball, Claire has the dog. - Alice and Bob swap. - Bob and Claire swap. - Alice and Claire swap. At the end of the game, who has the dog? A: Let's think step by step.

AnswerBob

The 3-object version is manageable. BBH also has 5-object and 7-object variants that are significantly harder. Models without CoT often fail to track the state correctly across multiple swaps.

Logical Deduction (5 objects)

Hard

Problem / Input

Q: The following paragraphs describe a set of five objects arranged in a fixed order. The statements are logically consistent. In a golf tournament: Ana, Ben, Cal, Dan, and Eve. - Ben finished lower than Eve. - Eve finished second. - Cal finished lower than Ana. - Dan finished last. - Ana finished higher than Ben. Which golfer finished third? A: Let's think step by step.

AnswerCal

5-object deduction requires maintaining 5 constraint variables simultaneously. LLMs without CoT fail on these due to working memory limitations.

Leaderboard Results

Model scores sorted by performance.

8 results

Sort:

#	Model	Score	Setup	Date
1	Opus 4.5Anthropic	93.1%	—	2025-11
2	Sonnet 4Anthropic	93.1%	—	2025-08
3	GPT-5OpenAI	V~91%	—	2025-08
4	Gemini 2.5 ProGoogle	89.2%	—	2025-03
5	Qwen 3 72BAlibaba	V~88%	—	2025-05
6	DeepSeek V3DeepSeek	V~87%	—	2024-12
7	Llama 4 MaverickMeta	V~85%	—	2025-04
8	GPT-4oOpenAI	~83%	—	2024-11

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

BBH proved chain-of-thought is not a minor prompting improvement but a fundamental capability unlock: Codex went from near-random to beating human performance on 17/23 tasks with 3-shot CoT.
The benchmark identified 'reasoning' as an emergent capability: performance was essentially flat across model sizes without CoT, then jumped sharply at scale with CoT.
BBH helped establish the paradigm that complex reasoning requires explicitly 'showing work' — influencing the design of o1/o3 extended reasoning models.
Google DeepMind's BBEH release is itself a finding: frontier models had saturated BBH so completely that a new harder benchmark was required within 2 years of BBH's introduction.

Variants & Related

BIG-Bench Extra Hard (BBEH)

6,511 tasksActive

Successor to BBH released by Google DeepMind in February 2025. Harder variants of each BBH task designed specifically because frontier models saturated the original. Still active benchmark.

Controversies & Caveats

Known limitations and criticisms.

⚠

The original CoT prompts for 3 BBH tasks (Multistep Arithmetic, Navigate, Web of Lies) contained logical errors that were later corrected. Some published results used the erroneous prompts.

⚠

BBH tasks are now well-known and documented online, meaning models may have seen CoT reasoning examples for these specific tasks during training.

⚠

Human baseline of ~50% represents average crowd workers on MTurk — not expert performance. Expert humans would score much higher on many tasks.

Links

arXiv Paper ↗Official Leaderboard ↗Dataset ↗