benchmark.darvinyi.com
← Back to Benchmarks
Agent TasksActive

AgentBench

LLM agents across 8 interactive environments: OS, databases, web, games, and more.

Tasks1,091
Year2023
CreatorXiao Liu, Hao Yu, Hanchen Zhang, et al.
MetricComposite score (weighted average across environments)

What It Tests

AgentBench evaluates LLMs as autonomous agents across 8 distinct interactive environments, testing multi-turn decision-making, tool use, and planning. Created by Xiao Liu and colleagues at Tsinghua University and published at ICLR 2024, AgentBench was among the first systematic evaluations of LLMs as agents across diverse practical domains.

The 8 environments span: Operating System (bash command execution), Database (SQL queries and management), Knowledge Graph (SPARQL traversal), Digital Card Game (strategic gameplay), Lateral Thinking Puzzles (multi-turn deduction), House-Holding (ALFWorld simulated household tasks), Web Shopping (product navigation), and Web Browsing (Mind2Web browser tasks).

A key finding was the dramatic performance gap between commercial and open-source models. GPT-4 scored 4.27 on the composite scale while the best open-source model (Llama-2-13B-chat) scored only 1.01. Most open-source models were essentially unusable as agents. This gap has narrowed significantly since 2023 as open-source models improved.

AgentBench FC (Function Calling) was released in 2025 to test standardized tool-calling interfaces — the modern way agents interact with external systems.

Task Anatomy

How a single task is structured.

InputAn environment-specific task description. The agent interacts in a multi-turn dialogue loop: observe environment state → choose action → receive feedback → repeat until task is complete or maximum steps reached.
OutputA sequence of actions in the environment-specific format (bash commands, SQL queries, SPARQL queries, card game moves, etc.).
EvaluationTask success rate per environment. Composite score across all 8 environments. OS: script output correctness. DB: query result accuracy. KG: SPARQL answer correctness. Card Game: win rate. Web: task completion.
MetricComposite score (weighted average across environments)

Example Tasks

3 real examples from the benchmark.

#1

OS — Bash task execution

Medium

Problem / Input

Find all Python files in /home/user that are larger than 1MB and have been modified in the last 7 days. Print their names and sizes.
AnswerCorrect file list with sizes

OS tasks test whether agents can navigate a bash environment to accomplish file system, process management, and scripting tasks. Requires understanding Unix command syntax and flags.

#2

Database — SQL query

Hard

Problem / Input

In the company database, find all employees who report to managers who were hired before 2010 and whose department's average salary is above $80,000. Return their names and salaries.
AnswerEmployee names and salaries matching all three criteria

Multi-table joins with subqueries require understanding relational database structure and SQL syntax. Many open-source models fail on complex multi-condition queries.

#3

Web Shopping — Product selection

Medium

Problem / Input

Find a wireless mechanical keyboard under $100 with Cherry MX switches and RGB backlighting. It should have at least 50 customer reviews. Add it to the cart.
AnswerCart contains a qualifying product

Web shopping tasks require multi-step navigation with complex multi-criteria filtering across an e-commerce site.

Leaderboard Results

Model scores sorted by performance.

5 results

Sort:
#ModelScore
1
GPT-5
6.8
2
Opus 4.5
~6.2
3
Sonnet 4
~5.2
4
DeepSeek V3
~4.5
5
GPT-4o
4.27

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

  • The original paper's most striking finding: GPT-4 (4.27) vs. best open-source model Llama-2-13B-chat (~1.0) — a 4x+ gap on agent tasks that wasn't present on standard benchmarks like MMLU.

  • Agent performance is not predicted by standard benchmark scores: models that score similarly on MMLU can differ dramatically on AgentBench, suggesting agent ability is a distinct capability.

  • Multi-turn failure modes: agents frequently fail by losing context over 10+ turns, forgetting environment state, or getting stuck in loops rather than making individual wrong decisions.

  • The OS environment proved particularly diagnostic: bash command execution exposed whether models had genuine systems understanding or were just pattern-matching keywords.

Controversies & Caveats

Known limitations and criticisms.

The composite score weights 8 environments differently — the choice of weights significantly affects rankings, and different weighting schemes could change which model appears best.

The performance gap between commercial and open-source models documented in 2023 has substantially narrowed by 2025, making the benchmark's original key finding less relevant.

AgentBench FC (2025) uses a different evaluation protocol than the original, making cross-version comparisons invalid.

Links