GAIA
Multi-step real-world tasks that are conceptually simple for humans but require tool-using agents.
What It Tests
GAIA (General AI Assistants benchmark) was designed to measure whether AI assistants can solve tasks that are 'conceptually simple for humans' but require multi-step reasoning, tool use, multimodal understanding, and multi-hop information retrieval. Created by Meta AI and HuggingFace researchers and introduced at ICLR 2024.
The benchmark consists of 466 questions (166 public dev, 300 private test) across 3 difficulty levels. Level 1 requires minimal steps and tools. Level 2 requires 5-10 steps with multiple coordinated tools. Level 3 requires long-horizon planning with advanced multi-tool integration. All answers are unambiguous exact strings — no partial credit.
The 'conceptually simple' framing is key: these are questions an intelligent adult could answer, but only by combining web search, file reading, calculator use, and multi-hop reasoning across multiple sources. Standalone LLMs (without tool access) score near-random. Tool-augmented agents make the task tractable.
GAIA is notable for Level 3 questions that seem impossible: 'Which fruits shown in the 2008 painting Embroidery from Uzbekistan were served at the October 1949 breakfast menu of the ocean liner later used as a prop in The Last Voyage? List clockwise from 12 o'clock.' This question requires looking up a painting, identifying fruits, finding an ocean liner, finding its historical breakfast menus, and correlating the information — truly multi-hop.
Human baseline is 92%. The best AI agents now score 67%+ on the validation set, with ensemble approaches reaching 91%+.
Task Anatomy
How a single task is structured.
Example Tasks
3 real examples from the benchmark.
Level 1 — Simple fact retrieval with calculation
Problem / Input
Level 1 tasks require tool use but are conceptually simple — a student could do this with Google in 2 minutes.
Level 2 — File analysis + web research
Problem / Input
[Attached: a PDF of a scientific paper with a table of experimental results]
The paper reports results for Experiment 3B. Convert the mean value in column 4, row 7 of Table 2 from centimeters to inches, then find the scientific name of the species described in the abstract.Level 2 tasks require coordinating file reading and web search. Models without file access cannot complete this at all.
Level 3 — Multi-hop research across obscure sources
Problem / Input
Which of the fruits shown in the 2008 painting 'Embroidery from Uzbekistan' were served as part of the October 1949 breakfast menu for the ocean liner later used as a floating prop in 'The Last Voyage'? List the fruits clockwise from the 12 o'clock position.This is the most cited Level 3 GAIA example. It requires 5+ distinct web searches, synthesizing information across completely unrelated domains (Uzbek art, 1960s cinema, 1940s maritime records), and spatial reasoning about painting composition. Even the best agents often fail here.
Leaderboard Results
Model scores sorted by performance.
4 results
| # | Model | Score |
|---|---|---|
| 1 | GPT-5 | V67.0% |
| 2 | Opus 4.5 | ~57% |
| 3 | Gemini 2.5 Pro | ~52% |
| 4 | GPT-4o | ~15% |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
GPT-4 with plugins scored only 15% at the benchmark's introduction in 2023. OpenAI Deep Research reached 67% by 2025 — a 4.5x improvement in 2 years, all from better agent scaffolding rather than model upgrades alone.
Human performance (92%) vs. best agent (67%) reveals a persistent ~25-point gap on tasks humans find conceptually simple. The gap is entirely in execution: multi-tool coordination, file handling, and multi-hop reasoning chains.
GAIA validated that standalone LLMs (without tools) are not viable for real research tasks — scoring near random on Level 2-3 questions. The benchmark helped establish tool-augmented agents as a distinct evaluation category.
Level 1 scores are now high (>90% for top agents) but Level 3 scores remain in the 20-40% range even for the best systems — illustrating that task complexity, not just tool access, remains a fundamental bottleneck.
Controversies & Caveats
Known limitations and criticisms.
Ensemble approaches (combining multiple agent runs) push scores to 91%+ on the validation set, near-matching human performance. These ensemble results are not comparable to single-agent performance.
Test set labels are private, making independent verification difficult. All public scores are on the 166-question validation set.
Level 3 questions are so dependent on specific historical records that 'correct' answers may change as internet sources are updated or removed.
The benchmark was created by Meta AI, and Manus AI (which reached SOTA at launch) was subsequently acquired by Meta — a notable potential conflict of interest in benchmark competition.