benchmark.darvinyi.com
← Back to Benchmarks
Agent TasksActive

WebArena / VisualWebArena

Autonomous browser agents completing realistic tasks on functional sandboxed websites.

Tasks812
Year2023
CreatorShuyan Zhou, Frank F. Xu, Hao Zhu, et al.
MetricTask success rate
Human Baseline78.24%

What It Tests

WebArena tests whether AI agents can complete realistic, long-horizon tasks on functional websites — not simulated, but fully operational sandboxed web applications running in Docker containers. Created by researchers at CMU and published at NeurIPS 2024, WebArena provides five domains: an e-commerce site (OneStopShop), a social forum (Reddit-like), a software development platform (GitLab), a content management system, and a mapping service.

Tasks are expressed in natural language and require the agent to navigate, click, type, fill forms, and submit actions across 10+ steps. The key test is whether agents can accomplish real user goals, not just execute scripted workflows. GPT-4's baseline success rate was only 14.41% while humans complete 78.24% of tasks.

VisualWebArena extends WebArena to multimodal agents that must interpret images to complete tasks. Tasks include finding products matching a reference image, comparing visual designs, or identifying UI elements from screenshots. 910 tasks across 3 environments (Classifieds, Reddit, Shopping).

By 2025-2026, top agents have significantly closed the gap with humans. OpAgent achieves 71.6% on WebArena — surpassing the original 78.24% human baseline is now within reach. The benchmark remains one of the best measures of practical web agent capability.

Task Anatomy

How a single task is structured.

InputA natural-language task description (e.g., 'Find the top-rated noise-canceling headphones under $150 and add them to my cart'). The agent has access to a fully functional browser with the sandboxed websites.
OutputA sequence of browser actions (click, type, navigate, scroll, submit) that achieve the task goal.
EvaluationFunctional correctness of the end state. Evaluation scripts check whether the task goal was achieved (e.g., the correct item is in the cart, the correct post was made, the correct file was committed).
MetricTask success rate

Example Tasks

3 real examples from the benchmark.

#1

E-commerce — Product search and purchase

Medium

Problem / Input

On the shopping site, find the cheapest noise-canceling headphones with at least 4-star reviews and a battery life of more than 20 hours. Add them to your cart.
AnswerCart contains the correct product

Requires multi-step navigation, reading product specifications across multiple pages, and applying multiple filter criteria simultaneously.

#2

GitLab — Repository management

Hard

Problem / Input

On GitLab, find the most recently updated repository in the 'ml-team' group that has more than 5 open issues, and add a label 'needs-review' to all issues that haven't been commented on.
AnswerAll uncommented open issues in the correct repo have the 'needs-review' label

Complex multi-entity task requiring tracking state across multiple pages and making conditional decisions per item.

#3

VisualWebArena — Image-based product search

Medium (VisualWebArena)

Problem / Input

[Attached image: a red leather handbag with gold hardware] Find a handbag on the shopping site that looks most similar to the one in the image. Add it to your wishlist.
AnswerWishlist contains a visually similar handbag

VisualWebArena tasks require multimodal agents that can match visual features across images — a capability standard text agents entirely lack.

Leaderboard Results

Model scores sorted by performance.

5 results

Sort:
#ModelScore
1
Opus 4.5
71.6%
2
GPT-5
58.1%
3
Gemini 2.5 Pro
~48%
4
Sonnet 4
~42%
5
GPT-4o
14.41%

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

  • GPT-4's 14.41% baseline at launch (vs 78.24% human) was a wake-up call: even the best 2023 model could not reliably navigate websites — a task any internet user performs daily.

  • Progress from 14% (2023) to 71% (2025) is almost entirely from better agent scaffolding, memory, and planning — demonstrating that agent framework design matters as much as model capability.

  • VisualWebArena revealed a new bottleneck: even agents that excel on text-based WebArena struggle when tasks require visual image matching, falling to ~16-36% vs. 89% for humans.

  • WebArena has become the standard for browser agent evaluation, spawning multiple derivative benchmarks (AssistGUI, OSWorld, ScreenSpot) that extend to desktop applications.

Variants & Related

VisualWebArena

910 tasksActive

910 tasks requiring visual reasoning — finding products by image, comparing designs, reading UI screenshots. GPT-4V baseline ~16%; best systems ~33-36%; human ~89%.

TheAgentCompany

175 tasksActive

CMU's simulated software company environment. Related benchmark using office tools (GitLab, RocketChat, Plane) with 16 AI colleagues. 175 tasks across 6 professional domains.

Controversies & Caveats

Known limitations and criticisms.

Human baseline of 78.24% is from crowd workers — professional users would score higher. The 'human vs. agent' comparison understates human capability.

Sandboxed environments may not fully replicate the complexity of real websites (no CAPTCHAs, consistent state, no rate limiting).

Top agent scores (OpAgent 71.6%) depend heavily on scaffolding and prompting strategies, making model-to-model comparison difficult when different agent frameworks are used.

Links