AgentBench
LLM agents across 8 interactive environments: OS, databases, web, games, and more.
What It Tests
AgentBench evaluates LLMs as autonomous agents across 8 distinct interactive environments, testing multi-turn decision-making, tool use, and planning. Created by Xiao Liu and colleagues at Tsinghua University and published at ICLR 2024, AgentBench was among the first systematic evaluations of LLMs as agents across diverse practical domains.
The 8 environments span: Operating System (bash command execution), Database (SQL queries and management), Knowledge Graph (SPARQL traversal), Digital Card Game (strategic gameplay), Lateral Thinking Puzzles (multi-turn deduction), House-Holding (ALFWorld simulated household tasks), Web Shopping (product navigation), and Web Browsing (Mind2Web browser tasks).
A key finding was the dramatic performance gap between commercial and open-source models. GPT-4 scored 4.27 on the composite scale while the best open-source model (Llama-2-13B-chat) scored only 1.01. Most open-source models were essentially unusable as agents. This gap has narrowed significantly since 2023 as open-source models improved.
AgentBench FC (Function Calling) was released in 2025 to test standardized tool-calling interfaces — the modern way agents interact with external systems.
Task Anatomy
How a single task is structured.
Example Tasks
3 real examples from the benchmark.
OS — Bash task execution
Problem / Input
OS tasks test whether agents can navigate a bash environment to accomplish file system, process management, and scripting tasks. Requires understanding Unix command syntax and flags.
Database — SQL query
Problem / Input
In the company database, find all employees who report to managers who were hired before 2010 and whose department's average salary is above $80,000. Return their names and salaries.Multi-table joins with subqueries require understanding relational database structure and SQL syntax. Many open-source models fail on complex multi-condition queries.
Web Shopping — Product selection
Problem / Input
Web shopping tasks require multi-step navigation with complex multi-criteria filtering across an e-commerce site.
Leaderboard Results
Model scores sorted by performance.
5 results
| # | Model | Score |
|---|---|---|
| 1 | GPT-5 | 6.8 |
| 2 | Opus 4.5 | ~6.2 |
| 3 | Sonnet 4 | ~5.2 |
| 4 | DeepSeek V3 | ~4.5 |
| 5 | GPT-4o | 4.27 |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
The original paper's most striking finding: GPT-4 (4.27) vs. best open-source model Llama-2-13B-chat (~1.0) — a 4x+ gap on agent tasks that wasn't present on standard benchmarks like MMLU.
Agent performance is not predicted by standard benchmark scores: models that score similarly on MMLU can differ dramatically on AgentBench, suggesting agent ability is a distinct capability.
Multi-turn failure modes: agents frequently fail by losing context over 10+ turns, forgetting environment state, or getting stuck in loops rather than making individual wrong decisions.
The OS environment proved particularly diagnostic: bash command execution exposed whether models had genuine systems understanding or were just pattern-matching keywords.
Controversies & Caveats
Known limitations and criticisms.
The composite score weights 8 environments differently — the choice of weights significantly affects rankings, and different weighting schemes could change which model appears best.
The performance gap between commercial and open-source models documented in 2023 has substantially narrowed by 2025, making the benchmark's original key finding less relevant.
AgentBench FC (2025) uses a different evaluation protocol than the original, making cross-version comparisons invalid.