← Back to BenchmarksPrimary source ↗
agentNEWPending curation
OSWorld
369 real-world computer-use tasks on Ubuntu/Windows/macOS requiring GUI navigation, desktop app control, and multi-app workflows. Humans succeed at 72%; best model achieved 12% at launch (NeurIPS 2024).
Year2024
Why our crawl picked it up
Notes the discovery agent wrote when proposing this benchmark.
First benchmark providing full live OS environments (not simulations) for agent evaluation. Covers a much broader scope than WebArena (web-only), testing file I/O, spreadsheets, email clients, and cross-app workflows. Custom per-task evaluation scripts enable reliable automated scoring.
Source
This entry was added by an automated crawl and hasn't been curated yet. Once it's reviewed and promoted into the bundled set, you'll see task anatomy, examples, scores, and richer context here.