benchmark.darvinyi.com
← Back to Benchmarks
Agent TasksActive

TheAgentCompany

A simulated software company with 16 AI colleagues testing real office work tasks.

Tasks175
Year2024
CreatorFrank F. Xu, Yufan Song, Boxuan Li, et al.
MetricTask completion rate (binary) and partial completion score

What It Tests

TheAgentCompany is a benchmark that simulates a complete software company environment with 16 AI-simulated colleagues. Created by researchers at CMU (Frank F. Xu, Yufan Song, Boxuan Li, et al.) and presented at NeurIPS 2024/2025, it tests AI agents on professional tasks across software engineering, project management, HR, finance, administration, and data analysis.

The benchmark uses Docker-hosted instances of real office software: GitLab (code repositories), Plane (project management, like Linear), ownCloud (file storage, like Google Drive), and RocketChat (messaging, like Slack). Agents can interact with all of these tools and communicate with 16 simulated AI colleagues who have distinct personas, backstories, and knowledge — adding social complexity to task completion.

Tasks range from 'commit a bugfix to the main branch' (technical) to 'process the quarterly payroll spreadsheet' (administrative) to 'send a project status update to all stakeholders' (communications). This breadth tests whether agents can navigate the full spectrum of white-collar knowledge work, not just coding.

Top scores: Gemini 2.5 Pro via OpenHands reaches 30.3% on full task completion (39.3% partial). Community-submitted results have pushed to 42.9% with specialized agent systems. The benchmark costs approximately $4 per run for frontier models.

Task Anatomy

How a single task is structured.

InputA task description from a 'manager' or 'colleague' perspective. The agent has access to GitLab, Plane, ownCloud, RocketChat, and can communicate with 16 AI-simulated colleagues who may have relevant information.
OutputCompletion of the specified task across one or more applications (committed code, filed document, sent message, updated project status, etc.).
EvaluationAutomated scripts check whether specific state changes were made correctly: commit exists in the correct branch, file has correct content, message was sent to correct recipients, expense report has correct figures, etc.
MetricTask completion rate (binary) and partial completion score

Example Tasks

3 real examples from the benchmark.

#1

Software Engineering — Bug triage

Medium

Problem / Input

There's a bug reported in the checkout module (GitLab issue #347). Look at the issue, find the relevant code, implement a fix, and create a pull request. Tag the issue as 'in-progress' and notify the backend team on RocketChat.
AnswerPull request open, issue tagged, RocketChat message sent

Requires coordinating 3 different tools. Partial credit is given for completing subsets of steps.

#2

HR — Onboarding coordination

Hard

Problem / Input

New hire Alex starts Monday. Create their accounts on all internal systems, add them to the engineering team in Plane, share the onboarding documents from ownCloud with them, and send a welcome message on RocketChat. Get their GitHub handle from Sarah in HR.
AnswerAll accounts created, team membership added, documents shared, welcome sent

Requires gathering information from a colleague (social task), then executing 4 separate tool operations. This 'social + tool' combination is a distinctive feature of TheAgentCompany.

#3

Finance — Expense processing

Medium

Problem / Input

Process the Q3 expense reports in the Finance folder on ownCloud. For each employee who spent over $500, create a Plane task assigned to their manager for approval. Send a summary to the CFO on RocketChat.
AnswerPlane tasks created, RocketChat message sent with correct summary

ownCloud UI navigation is consistently the hardest part for agents — the web-based office suite has complex navigation that models struggle with more than CLI tools.

Leaderboard Results

Model scores sorted by performance.

5 results

Sort:
#ModelScore
1
Gemini 2.5 Pro
30.3%
2
Sonnet 4
26.3%
3
GPT-5
~24%
4
GPT-4o
8.6%
5
Llama 4 Maverick
~7.4%

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

  • Agents perform substantially better on technical tasks (software engineering: ~42% completion) than on social/administrative tasks (HR: ~18%, Finance: ~22%) — a gap that doesn't appear in coding-only benchmarks.

  • The social coordination requirement (messaging colleagues for information) adds meaningful difficulty: tasks requiring inter-agent information gathering fail significantly more often than single-tool tasks.

  • Cost per run ($4 for Gemini 2.5 Pro, <$1 for Gemini 2.0 Flash) makes this benchmark tractable for production evaluation — an important practical consideration.

  • Community leaderboard submissions (TTE-MatrixAgent at 42.9%) substantially exceed the original paper baselines, suggesting specialized agent scaffolding on top of base models matters more than the base model itself for this benchmark.

Controversies & Caveats

Known limitations and criticisms.

The 16 AI colleagues simulate social complexity, but real office work involves much richer social dynamics, organizational politics, and ambiguous communication that the simulated colleagues don't capture.

ownCloud (web file storage) is disproportionately hard for agents compared to CLI tools — potentially making the benchmark less about agent intelligence and more about UI navigation ability.

At $4 per run for frontier models, the benchmark is relatively expensive to evaluate. Community results use cheaper models that may not reflect true capability.

Links