METR Time Horizon
How long can AI agents work autonomously? The 50%-success time horizon, doubling every 3 months.
Overview
METR (Model Evaluation & Threat Research) is a safety-focused nonprofit (formerly ARC Evals) that performs pre-deployment evaluations for Anthropic, OpenAI, and other labs. Their Time Horizon metric asks a simple question with profound implications: 'What duration of task (measured in human expert time) can this AI agent complete with 50% reliability?'
The metric was designed as an interpretable safety measure — tracking whether AI systems are approaching the capability to conduct sustained autonomous actions that could be harmful. A system that can work autonomously for hours becomes capable of executing complex, multi-step plans without human checkpoints.
Construction: METR contracts skilled professionals (~5 years experience in software engineering, ML, or cybersecurity from top universities) to complete a suite of self-contained tasks, measuring their completion time. AI agents then make 6 independent attempts per task. A logistic curve is fitted predicting P(success) as a function of task duration, and the 50% time horizon = where this curve crosses 0.5.
The headline result: this metric has been growing exponentially, doubling approximately every 89 days since 2024. GPT-3-era agents: ~9 seconds. GPT-4: ~4 minutes. Claude 3.5 Sonnet: ~40 minutes. Claude Opus 4.6: ~14.5 hours. At the current doubling rate, week-long autonomous tasks may be achievable by late 2026.
How It Works
Task setup, inputs, outputs, and evaluation.
Example Tasks
Real tasks from this evaluation system.
Short task — ~5 minute human time
Implement a function solving a specific algorithmic problem (e.g., find all pairs in an integer list summing to a target value). Write unit tests. The task is self-contained with a clear specification and automatically-testable output.
What the Agent Receives
What It Must Produce
How Success Is Judged
Medium task — ~2 hour human time
Implement a network protocol from scratch based on a provided technical specification. Must handle all specified edge cases, pass the provided test suite, and include documentation.
What the Agent Receives
What It Must Produce
How Success Is Judged
Long task — ~8+ hour human time
Debug a failing distributed system. Identify root cause from logs and metrics, implement a fix that handles concurrent failures, verify with load testing, and write a postmortem report.
What the Agent Receives
What It Must Produce
How Success Is Judged
Results
Model performance on this evaluation.
| # | Model | Score |
|---|---|---|
| 1 | Opus 4.6 | ~14.5 hours |
| 2 | GPT-5 | ~6.6 hours |
| 3 | Opus 4.5 | ~4h 49m |
| 4 | o3 | ~2 hours |
| 5 | Sonnet 4 | ~57 minutes |
| 6 | GPT-4o | ~10 minutes |
Key Findings
Exponential growth: 50% time horizon has grown from ~9 seconds (2020) to ~14.5 hours (Feb 2026), a ~5,800x increase in 6 years, doubling approximately every 89 days since 2024.
Wide confidence intervals: Claude Opus 4.6's CI is 6 hours to 98 hours — the true value could be anywhere in that range, making precise model comparison difficult near the current frontier.
Domain specificity: time horizons for software engineering tasks are 40-100x higher than for visual computer-use or physical tasks. The headline metric reflects SW/ML capability specifically.
Task design constraint: the benchmark is approaching saturation — Claude Opus 4.6 completes many 8-hour tasks, requiring METR to add longer tasks to maintain discrimination.
Safety implication: a model with a 14.5-hour time horizon is capable of autonomous multi-step operations sustained overnight without human checkpoints — a threshold relevant to AI safety evaluation.
What Makes It Unique
- ✓
Single interpretable number: 'This model can complete 50%-probability tasks that take a human expert X hours' is more meaningful to policymakers than '74.2% on MMLU'.
- ✓
Safety-first design: created by a safety nonprofit, not a capability lab — designed to detect dangerous capability thresholds, not to flatter AI progress.
- ✓
Historical tracking with honest uncertainty: the only benchmark with a rigorous 6-year exponential trend and properly-computed, wide confidence intervals.
- ✓
Doubles as pre-deployment safety evaluation: METR uses the same methodology for official safety assessments submitted to governments and used in lab deployment decisions.
Controversies & Caveats
Very wide confidence intervals (often 4-15x range) make precise model comparison impossible — Claude Opus 4.6's CI is 6h–98h.
SW-only task distribution: the headline time horizon doesn't transfer to other task types. Visual computer-use tasks have 40-100x lower time horizons.
Low-context assumption: tasks are self-contained briefs with all necessary information. Real professional work involves institutional context, relationships, and shifting requirements that would lower performance.
Near-saturation problem: the current suite is becoming too easy for top models, and adding 8+ hour tasks is logistically difficult (requires human contractors to baseline).
Doubling rate extrapolation is linear-in-log-scale — assumes the current trend continues unchanged, which may not hold as task types become qualitatively more complex.