agentNEWPending curation

OSWorld

369 real-world computer-use tasks on Ubuntu/Windows/macOS requiring GUI navigation, desktop app control, and multi-app workflows. Humans succeed at 72%; best model achieved 12% at launch (NeurIPS 2024).

Year2024

Why our crawl picked it up

Notes the discovery agent wrote when proposing this benchmark.

First benchmark providing full live OS environments (not simulations) for agent evaluation. Covers a much broader scope than WebArena (web-only), testing file I/O, spreadsheets, email clients, and cross-app workflows. Custom per-task evaluation scripts enable reliable automated scoring.

Source

Primary source ↗

This entry was added by an automated crawl and hasn't been curated yet. Once it's reviewed and promoted into the bundled set, you'll see task anatomy, examples, scores, and richer context here.