WebArena / VisualWebArena
Autonomous browser agents completing realistic tasks on functional sandboxed websites.
What It Tests
WebArena tests whether AI agents can complete realistic, long-horizon tasks on functional websites — not simulated, but fully operational sandboxed web applications running in Docker containers. Created by researchers at CMU and published at NeurIPS 2024, WebArena provides five domains: an e-commerce site (OneStopShop), a social forum (Reddit-like), a software development platform (GitLab), a content management system, and a mapping service.
Tasks are expressed in natural language and require the agent to navigate, click, type, fill forms, and submit actions across 10+ steps. The key test is whether agents can accomplish real user goals, not just execute scripted workflows. GPT-4's baseline success rate was only 14.41% while humans complete 78.24% of tasks.
VisualWebArena extends WebArena to multimodal agents that must interpret images to complete tasks. Tasks include finding products matching a reference image, comparing visual designs, or identifying UI elements from screenshots. 910 tasks across 3 environments (Classifieds, Reddit, Shopping).
By 2025-2026, top agents have significantly closed the gap with humans. OpAgent achieves 71.6% on WebArena — surpassing the original 78.24% human baseline is now within reach. The benchmark remains one of the best measures of practical web agent capability.
Task Anatomy
How a single task is structured.
Example Tasks
3 real examples from the benchmark.
E-commerce — Product search and purchase
Problem / Input
On the shopping site, find the cheapest noise-canceling headphones with at least 4-star reviews and a battery life of more than 20 hours. Add them to your cart.Requires multi-step navigation, reading product specifications across multiple pages, and applying multiple filter criteria simultaneously.
GitLab — Repository management
Problem / Input
On GitLab, find the most recently updated repository in the 'ml-team' group that has more than 5 open issues, and add a label 'needs-review' to all issues that haven't been commented on.Complex multi-entity task requiring tracking state across multiple pages and making conditional decisions per item.
VisualWebArena — Image-based product search
Problem / Input
VisualWebArena tasks require multimodal agents that can match visual features across images — a capability standard text agents entirely lack.
Leaderboard Results
Model scores sorted by performance.
5 results
| # | Model | Score |
|---|---|---|
| 1 | Opus 4.5 | 71.6% |
| 2 | GPT-5 | 58.1% |
| 3 | Gemini 2.5 Pro | ~48% |
| 4 | Sonnet 4 | ~42% |
| 5 | GPT-4o | 14.41% |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
GPT-4's 14.41% baseline at launch (vs 78.24% human) was a wake-up call: even the best 2023 model could not reliably navigate websites — a task any internet user performs daily.
Progress from 14% (2023) to 71% (2025) is almost entirely from better agent scaffolding, memory, and planning — demonstrating that agent framework design matters as much as model capability.
VisualWebArena revealed a new bottleneck: even agents that excel on text-based WebArena struggle when tasks require visual image matching, falling to ~16-36% vs. 89% for humans.
WebArena has become the standard for browser agent evaluation, spawning multiple derivative benchmarks (AssistGUI, OSWorld, ScreenSpot) that extend to desktop applications.
Variants & Related
VisualWebArena
910 tasks requiring visual reasoning — finding products by image, comparing designs, reading UI screenshots. GPT-4V baseline ~16%; best systems ~33-36%; human ~89%.
TheAgentCompany
CMU's simulated software company environment. Related benchmark using office tools (GitLab, RocketChat, Plane) with 16 AI colleagues. 175 tasks across 6 professional domains.
Controversies & Caveats
Known limitations and criticisms.
Human baseline of 78.24% is from crowd workers — professional users would score higher. The 'human vs. agent' comparison understates human capability.
Sandboxed environments may not fully replicate the complexity of real websites (no CAPTCHAs, consistent state, no rate limiting).
Top agent scores (OpAgent 71.6%) depend heavily on scaffolding and prompting strategies, making model-to-model comparison difficult when different agent frameworks are used.