BIG-Bench Hard
23 hard reasoning tasks where chain-of-thought is required to exceed human performance.
What It Tests
BIG-Bench Hard (BBH) is a curated subset of the larger BIG-Bench collaborative benchmark (200+ tasks from 100+ organizations). The BBH authors identified the 23 tasks from BIG-Bench where no language model exceeded average human performance even with few-shot prompting — selecting tasks that specifically required complex reasoning rather than knowledge retrieval.
Created by Mirac Suzgun and colleagues at Google and published at EMNLP 2023, BBH's central finding was that chain-of-thought (CoT) prompting specifically unlocked performance on these tasks. Without CoT, even large models scored near-randomly. With 3-shot CoT, Codex exceeded human performance on 17/23 tasks and PaLM on 10/23. This empirically demonstrated that CoT is not just a minor prompting trick but a fundamental capability enabler.
BBH covers 23 diverse tasks: Boolean logic, logical deduction (3/5/7 objects), tracking shuffled objects, causal judgment, date understanding, dyck language completion, formal fallacies, geometric shapes, hyperbaton, movie recommendation, multi-step arithmetic, navigation, object counting, penguins in a table, colored object reasoning, snarks, sports understanding, temporal sequences, web of lies, and word sorting.
BBH is now approaching saturation for frontier models, with top scores at 87-93%. Google DeepMind released BIG-Bench Extra Hard (BBEH) in February 2025 specifically because frontier models had saturated BBH. BBH remains useful for evaluating mid-tier models and as a regression test.
Task Anatomy
How a single task is structured.
Example Tasks
4 real examples from the benchmark.
Boolean Expressions
Problem / Input
Without CoT, models struggle to track the nested negation correctly. With step-by-step reasoning, performance jumps significantly.
Web of Lies
Problem / Input
Requires tracking truth values through a 6-step chain. Each step flips or preserves the truth status based on whether the speaker is reliable. Without CoT, models lose track of the chain.
Tracking Shuffled Objects (3 objects)
Problem / Input
The 3-object version is manageable. BBH also has 5-object and 7-object variants that are significantly harder. Models without CoT often fail to track the state correctly across multiple swaps.
Logical Deduction (5 objects)
Problem / Input
5-object deduction requires maintaining 5 constraint variables simultaneously. LLMs without CoT fail on these due to working memory limitations.
Leaderboard Results
Model scores sorted by performance.
8 results
| # | Model | Score |
|---|---|---|
| 1 | Opus 4.5 | 93.1% |
| 2 | Sonnet 4 | 93.1% |
| 3 | GPT-5 | V~91% |
| 4 | Gemini 2.5 Pro | 89.2% |
| 5 | Qwen 3 72B | V~88% |
| 6 | DeepSeek V3 | V~87% |
| 7 | Llama 4 Maverick | V~85% |
| 8 | GPT-4o | ~83% |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
BBH proved chain-of-thought is not a minor prompting improvement but a fundamental capability unlock: Codex went from near-random to beating human performance on 17/23 tasks with 3-shot CoT.
The benchmark identified 'reasoning' as an emergent capability: performance was essentially flat across model sizes without CoT, then jumped sharply at scale with CoT.
BBH helped establish the paradigm that complex reasoning requires explicitly 'showing work' — influencing the design of o1/o3 extended reasoning models.
Google DeepMind's BBEH release is itself a finding: frontier models had saturated BBH so completely that a new harder benchmark was required within 2 years of BBH's introduction.
Variants & Related
BIG-Bench Extra Hard (BBEH)
Successor to BBH released by Google DeepMind in February 2025. Harder variants of each BBH task designed specifically because frontier models saturated the original. Still active benchmark.
Controversies & Caveats
Known limitations and criticisms.
The original CoT prompts for 3 BBH tasks (Multistep Arithmetic, Navigate, Web of Lies) contained logical errors that were later corrected. Some published results used the erroneous prompts.
BBH tasks are now well-known and documented online, meaning models may have seen CoT reasoning examples for these specific tasks during training.
Human baseline of ~50% represents average crowd workers on MTurk — not expert performance. Expert humans would score much higher on many tasks.