Remote Labor Index (RLI)
Real Upwork freelance projects: $143,991 of actual paid work, 2.5% maximum automation.
Overview
The Remote Labor Index (RLI) is the most rigorous real-world work evaluation benchmark: it uses actual projects that real human freelancers were hired and paid to complete on Upwork, including the original work brief and the human-produced gold-standard deliverable. Created jointly by the Center for AI Safety (CAIS) and Scale AI, published October 30, 2025.
Unlike benchmarks that use expert-constructed tasks, RLI sources its tasks entirely from the market: real client briefs, real freelancer submissions, real payment amounts. 358 verified freelancers provided samples of completed work across 43 eligible Upwork job categories. The resulting 240 projects represent $143,991 in total economic value and over 6,000 hours of human labor.
The benchmark evaluates AI on complete end-to-end project delivery: agents must produce finished, client-ready deliverables (game builds, architectural drawings, video animations, brand logos) — not isolated subtasks. Evaluation is performed manually by trained expert graders asking: 'Would a reasonable client accept this in place of the human gold standard?'
The headline result is striking: the best AI agent (Manus) automates only 2.5% of projects to client-acceptable quality — despite the same models scoring 80%+ on SWE-bench and 67% on GAIA. The gap reveals the fundamental difficulty of real end-to-end project work: 28.9-hour average tasks with multimodal output requirements across CAD, video, 3D, and audio formats that current AI systems cannot reliably produce.
How It Works
Task setup, inputs, outputs, and evaluation.
Example Tasks
Real tasks from this evaluation system.
Game Development — Unity Mini Arcade Game
Game Development
Design and implement a mini arcade-style game in Unity with 3 levels, score tracking, and the provided art assets.
What the Agent Receives
What It Must Produce
How Success Is Judged
Architectural Design — Residential Floor Plans
Architectural Design
Create professional floor plan drawings for a 2,500 sq ft residential home with specific room requirements.
What the Agent Receives
What It Must Produce
How Success Is Judged
Video Animation — Explainer Video
Video Animation
Create a 60-second explainer animation for a SaaS product with voiceover sync and brand consistency.
What the Agent Receives
What It Must Produce
How Success Is Judged
Results
Model performance on this evaluation.
| # | Model | Score |
|---|---|---|
| 1 | Opus 4.6 | 3.75% |
| 2 | Opus 4.5 | ~3.75% (tied) |
| 3 | GPT-5 | 2.5% |
| 4 | Grok 4 | 2.1% |
| 5 | Gemini 2.5 Pro | 1.25% |
| 6 | GPT-4o | ~0.5% |
Key Findings
The most sobering data point in AI evaluation: models scoring 80%+ on SWE-bench complete only 2.5% of real freelance projects to client-acceptable quality — a 32x performance gap between structured benchmarks and real end-to-end project delivery.
The primary failure modes are: (1) corrupt or wrong-format output files, (2) incomplete deliverables missing required components, (3) quality gap (technically present but visually/qualitatively below professional standard), (4) multi-file inconsistency, (5) misunderstanding creative intent.
Most RLI tasks require specialized software (Unity, AutoCAD, Adobe Creative Suite, DAWs) that current AI agents cannot reliably operate — this is a tool access problem as much as a capability problem.
Elo scores show steady relative improvement even when absolute automation rates are near zero — suggesting the benchmark is sensitive enough to track progress long before agents can complete full projects.
Average human completion time of 28.9 hours (vs 2 hours for APEX-Agents, 6.7 hours for GDPval) explains much of the automation gap — longer, more complex projects with multimodal outputs have far more failure modes.
What Makes It Unique
- ✓
True economic grounding: every project has a real dollar value set by the real professional who did the work — no synthetic cost estimates.
- ✓
Multi-format, multi-tool deliverables: projects span .blend, .mp4, .dwg, .ai, .exe, .wav — far more output format diversity than any other benchmark.
- ✓
Client-acceptance standard: 'Would a paying client accept this?' is a fundamentally higher bar than 'is it technically correct?'
- ✓
Complete end-to-end: AI must handle the entire project from brief to final deliverable, not just isolated subtasks.
- ✓
Safety organization perspective: created by CAIS (Center for AI Safety) alongside Scale AI — designed partly to ground (lower) inflated claims about AI automation readiness.
Controversies & Caveats
Only 10 of 240 projects are public — severely limiting independent reproducibility.
The 23 of 64 Upwork categories included may systematically differ in AI capability from excluded categories (e.g., back-end development, which might be more automatable).
Evaluation subjectivity: 'Would a client accept this?' depends heavily on evaluator standards. The same output might be acceptable to some clients and not others.
Scale AI is a major AI data and evaluation company that competes commercially — potential conflict of interest in benchmark design and execution.