benchmark.darvinyi.com
← Back to Benchmarks
Agent TasksActive

GAIA

Multi-step real-world tasks that are conceptually simple for humans but require tool-using agents.

Tasks466
Year2023
CreatorGrégoire Mialon, Clémentine Fourrier, et al.
MetricAccuracy (% correct)
Human Baseline92%

What It Tests

GAIA (General AI Assistants benchmark) was designed to measure whether AI assistants can solve tasks that are 'conceptually simple for humans' but require multi-step reasoning, tool use, multimodal understanding, and multi-hop information retrieval. Created by Meta AI and HuggingFace researchers and introduced at ICLR 2024.

The benchmark consists of 466 questions (166 public dev, 300 private test) across 3 difficulty levels. Level 1 requires minimal steps and tools. Level 2 requires 5-10 steps with multiple coordinated tools. Level 3 requires long-horizon planning with advanced multi-tool integration. All answers are unambiguous exact strings — no partial credit.

The 'conceptually simple' framing is key: these are questions an intelligent adult could answer, but only by combining web search, file reading, calculator use, and multi-hop reasoning across multiple sources. Standalone LLMs (without tool access) score near-random. Tool-augmented agents make the task tractable.

GAIA is notable for Level 3 questions that seem impossible: 'Which fruits shown in the 2008 painting Embroidery from Uzbekistan were served at the October 1949 breakfast menu of the ocean liner later used as a prop in The Last Voyage? List clockwise from 12 o'clock.' This question requires looking up a painting, identifying fruits, finding an ocean liner, finding its historical breakfast menus, and correlating the information — truly multi-hop.

Human baseline is 92%. The best AI agents now score 67%+ on the validation set, with ensemble approaches reaching 91%+.

Task Anatomy

How a single task is structured.

InputA natural-language question requiring multi-step reasoning and tool use. Questions are often accompanied by files (PDFs, images, spreadsheets). Level 1: simple, few steps. Level 2: medium, multiple tools needed. Level 3: complex, long-horizon planning.
OutputA single, unambiguous exact string answer.
EvaluationExact string match against the verified correct answer. No partial credit. Test set answers are private (submit via HuggingFace leaderboard).
MetricAccuracy (% correct)

Example Tasks

3 real examples from the benchmark.

#1

Level 1 — Simple fact retrieval with calculation

Level 1

Problem / Input

What is the population of France as of the most recent census, divided by the number of departments in France?
Answer~673,000 (exact value depends on census data used)

Level 1 tasks require tool use but are conceptually simple — a student could do this with Google in 2 minutes.

#2

Level 2 — File analysis + web research

Level 2

Problem / Input

[Attached: a PDF of a scientific paper with a table of experimental results]

The paper reports results for Experiment 3B. Convert the mean value in column 4, row 7 of Table 2 from centimeters to inches, then find the scientific name of the species described in the abstract.
Answer[Two-part answer: converted measurement + scientific name]

Level 2 tasks require coordinating file reading and web search. Models without file access cannot complete this at all.

#3

Level 3 — Multi-hop research across obscure sources

Level 3

Problem / Input

Which of the fruits shown in the 2008 painting 'Embroidery from Uzbekistan' were served as part of the October 1949 breakfast menu for the ocean liner later used as a floating prop in 'The Last Voyage'? List the fruits clockwise from the 12 o'clock position.
Answer[Specific fruits in specific order — this is the canonical hard GAIA example]

This is the most cited Level 3 GAIA example. It requires 5+ distinct web searches, synthesizing information across completely unrelated domains (Uzbek art, 1960s cinema, 1940s maritime records), and spatial reasoning about painting composition. Even the best agents often fail here.

Leaderboard Results

Model scores sorted by performance.

4 results

Sort:
#ModelScore
1
GPT-5
V67.0%
2
Opus 4.5
~57%
3
Gemini 2.5 Pro
~52%
4
GPT-4o
~15%

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

  • GPT-4 with plugins scored only 15% at the benchmark's introduction in 2023. OpenAI Deep Research reached 67% by 2025 — a 4.5x improvement in 2 years, all from better agent scaffolding rather than model upgrades alone.

  • Human performance (92%) vs. best agent (67%) reveals a persistent ~25-point gap on tasks humans find conceptually simple. The gap is entirely in execution: multi-tool coordination, file handling, and multi-hop reasoning chains.

  • GAIA validated that standalone LLMs (without tools) are not viable for real research tasks — scoring near random on Level 2-3 questions. The benchmark helped establish tool-augmented agents as a distinct evaluation category.

  • Level 1 scores are now high (>90% for top agents) but Level 3 scores remain in the 20-40% range even for the best systems — illustrating that task complexity, not just tool access, remains a fundamental bottleneck.

Controversies & Caveats

Known limitations and criticisms.

Ensemble approaches (combining multiple agent runs) push scores to 91%+ on the validation set, near-matching human performance. These ensemble results are not comparable to single-agent performance.

Test set labels are private, making independent verification difficult. All public scores are on the 166-question validation set.

Level 3 questions are so dependent on specific historical records that 'correct' answers may change as internet sources are updated or removed.

The benchmark was created by Meta AI, and Manus AI (which reached SOTA at launch) was subsequently acquired by Meta — a notable potential conflict of interest in benchmark competition.

Links