TheAgentCompany
A simulated software company with 16 AI colleagues testing real office work tasks.
What It Tests
TheAgentCompany is a benchmark that simulates a complete software company environment with 16 AI-simulated colleagues. Created by researchers at CMU (Frank F. Xu, Yufan Song, Boxuan Li, et al.) and presented at NeurIPS 2024/2025, it tests AI agents on professional tasks across software engineering, project management, HR, finance, administration, and data analysis.
The benchmark uses Docker-hosted instances of real office software: GitLab (code repositories), Plane (project management, like Linear), ownCloud (file storage, like Google Drive), and RocketChat (messaging, like Slack). Agents can interact with all of these tools and communicate with 16 simulated AI colleagues who have distinct personas, backstories, and knowledge — adding social complexity to task completion.
Tasks range from 'commit a bugfix to the main branch' (technical) to 'process the quarterly payroll spreadsheet' (administrative) to 'send a project status update to all stakeholders' (communications). This breadth tests whether agents can navigate the full spectrum of white-collar knowledge work, not just coding.
Top scores: Gemini 2.5 Pro via OpenHands reaches 30.3% on full task completion (39.3% partial). Community-submitted results have pushed to 42.9% with specialized agent systems. The benchmark costs approximately $4 per run for frontier models.
Task Anatomy
How a single task is structured.
Example Tasks
3 real examples from the benchmark.
Software Engineering — Bug triage
Problem / Input
There's a bug reported in the checkout module (GitLab issue #347). Look at the issue, find the relevant code, implement a fix, and create a pull request. Tag the issue as 'in-progress' and notify the backend team on RocketChat.Requires coordinating 3 different tools. Partial credit is given for completing subsets of steps.
HR — Onboarding coordination
Problem / Input
New hire Alex starts Monday. Create their accounts on all internal systems, add them to the engineering team in Plane, share the onboarding documents from ownCloud with them, and send a welcome message on RocketChat. Get their GitHub handle from Sarah in HR.Requires gathering information from a colleague (social task), then executing 4 separate tool operations. This 'social + tool' combination is a distinctive feature of TheAgentCompany.
Finance — Expense processing
Problem / Input
ownCloud UI navigation is consistently the hardest part for agents — the web-based office suite has complex navigation that models struggle with more than CLI tools.
Leaderboard Results
Model scores sorted by performance.
5 results
| # | Model | Score |
|---|---|---|
| 1 | Gemini 2.5 Pro | 30.3% |
| 2 | Sonnet 4 | 26.3% |
| 3 | GPT-5 | ~24% |
| 4 | GPT-4o | 8.6% |
| 5 | Llama 4 Maverick | ~7.4% |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
Agents perform substantially better on technical tasks (software engineering: ~42% completion) than on social/administrative tasks (HR: ~18%, Finance: ~22%) — a gap that doesn't appear in coding-only benchmarks.
The social coordination requirement (messaging colleagues for information) adds meaningful difficulty: tasks requiring inter-agent information gathering fail significantly more often than single-tool tasks.
Cost per run ($4 for Gemini 2.5 Pro, <$1 for Gemini 2.0 Flash) makes this benchmark tractable for production evaluation — an important practical consideration.
Community leaderboard submissions (TTE-MatrixAgent at 42.9%) substantially exceed the original paper baselines, suggesting specialized agent scaffolding on top of base models matters more than the base model itself for this benchmark.
Controversies & Caveats
Known limitations and criticisms.
The 16 AI colleagues simulate social complexity, but real office work involves much richer social dynamics, organizational politics, and ambiguous communication that the simulated colleagues don't capture.
ownCloud (web file storage) is disproportionately hard for agents compared to CLI tools — potentially making the benchmark less about agent intelligence and more about UI navigation ability.
At $4 per run for frontier models, the benchmark is relatively expensive to evaluate. Community results use cheaper models that may not reflect true capability.