benchmark.darvinyi.com
← Back to Agent Evaluations
METR (nonprofit, Berkeley-based)2025

METR Time Horizon

How long can AI agents work autonomously? The 50%-success time horizon, doubling every 3 months.

Tasks228
Public228
Primary Metric50% Task-Completion Time Horizon (hours of human-equivalent work)
Avg Human Time2 minutes to 8+ hours (full task suite)

Overview

METR (Model Evaluation & Threat Research) is a safety-focused nonprofit (formerly ARC Evals) that performs pre-deployment evaluations for Anthropic, OpenAI, and other labs. Their Time Horizon metric asks a simple question with profound implications: 'What duration of task (measured in human expert time) can this AI agent complete with 50% reliability?'

The metric was designed as an interpretable safety measure — tracking whether AI systems are approaching the capability to conduct sustained autonomous actions that could be harmful. A system that can work autonomously for hours becomes capable of executing complex, multi-step plans without human checkpoints.

Construction: METR contracts skilled professionals (~5 years experience in software engineering, ML, or cybersecurity from top universities) to complete a suite of self-contained tasks, measuring their completion time. AI agents then make 6 independent attempts per task. A logistic curve is fitted predicting P(success) as a function of task duration, and the 50% time horizon = where this curve crosses 0.5.

The headline result: this metric has been growing exponentially, doubling approximately every 89 days since 2024. GPT-3-era agents: ~9 seconds. GPT-4: ~4 minutes. Claude 3.5 Sonnet: ~40 minutes. Claude Opus 4.6: ~14.5 hours. At the current doubling rate, week-long autonomous tasks may be achievable by late 2026.

How It Works

Task setup, inputs, outputs, and evaluation.

SetupMETR's task suite (TH 1.1: 228 tasks from RE-Bench, HCAST, and SWAA) covers self-contained software engineering, ML, and cybersecurity tasks with clear, automatically-verifiable success criteria. Each task has a human expert time estimate from paid contractors.
InputA well-specified technical task (e.g., 'Implement a function that...', 'Debug this distributed system...', 'Build a tool that...'). All necessary context and files are provided.
OutputCode, results, or artifacts that satisfy the task's automatically-verifiable success criteria.
EvaluationAutomated success/fail per task. 6 attempts per task per model. Logistic regression fit: P(success | task_duration). 50% time horizon = argmax where P=0.5. Bootstrap 95% CI via 3-level hierarchical bootstrap (task families → tasks → runs).
Metric50% Task-Completion Time Horizon (in minutes/hours of human expert work)

Example Tasks

Real tasks from this evaluation system.

#1

Short task — ~5 minute human time

~5 minutes

Implement a function solving a specific algorithmic problem (e.g., find all pairs in an integer list summing to a target value). Write unit tests. The task is self-contained with a clear specification and automatically-testable output.

What the Agent Receives

Function specification, example inputs/outputs, test scaffold.

What It Must Produce

Working function + unit tests that pass the provided test suite.

How Success Is Judged

Automated: run test suite; pass = success, any failure = fail.
#2

Medium task — ~2 hour human time

~2 hours

Implement a network protocol from scratch based on a provided technical specification. Must handle all specified edge cases, pass the provided test suite, and include documentation.

What the Agent Receives

RFC-style specification document, partial test suite, example client/server stub code.

What It Must Produce

Complete protocol implementation passing all tests, with inline documentation.

How Success Is Judged

Automated test suite execution: all tests must pass within time and memory limits.
#3

Long task — ~8+ hour human time

8+ hours

Debug a failing distributed system. Identify root cause from logs and metrics, implement a fix that handles concurrent failures, verify with load testing, and write a postmortem report.

What the Agent Receives

Distributed system codebase, logs showing failures, metrics dashboard, test infrastructure.

What It Must Produce

Code fix + test results showing fixed behavior + written postmortem.

How Success Is Judged

Automated: system passes concurrent failure tests. Postmortem graded by checklist (root cause identified, fix explained, preventive measures listed).

Results

Model performance on this evaluation.

#ModelScore
1
Opus 4.6
~14.5 hours
2
GPT-5
~6.6 hours
3
Opus 4.5
~4h 49m
4
o3
~2 hours
5
Sonnet 4
~57 minutes
6
GPT-4o
~10 minutes

Key Findings

  • Exponential growth: 50% time horizon has grown from ~9 seconds (2020) to ~14.5 hours (Feb 2026), a ~5,800x increase in 6 years, doubling approximately every 89 days since 2024.

  • Wide confidence intervals: Claude Opus 4.6's CI is 6 hours to 98 hours — the true value could be anywhere in that range, making precise model comparison difficult near the current frontier.

  • Domain specificity: time horizons for software engineering tasks are 40-100x higher than for visual computer-use or physical tasks. The headline metric reflects SW/ML capability specifically.

  • Task design constraint: the benchmark is approaching saturation — Claude Opus 4.6 completes many 8-hour tasks, requiring METR to add longer tasks to maintain discrimination.

  • Safety implication: a model with a 14.5-hour time horizon is capable of autonomous multi-step operations sustained overnight without human checkpoints — a threshold relevant to AI safety evaluation.

What Makes It Unique

  • Single interpretable number: 'This model can complete 50%-probability tasks that take a human expert X hours' is more meaningful to policymakers than '74.2% on MMLU'.

  • Safety-first design: created by a safety nonprofit, not a capability lab — designed to detect dangerous capability thresholds, not to flatter AI progress.

  • Historical tracking with honest uncertainty: the only benchmark with a rigorous 6-year exponential trend and properly-computed, wide confidence intervals.

  • Doubles as pre-deployment safety evaluation: METR uses the same methodology for official safety assessments submitted to governments and used in lab deployment decisions.

Controversies & Caveats

Very wide confidence intervals (often 4-15x range) make precise model comparison impossible — Claude Opus 4.6's CI is 6h–98h.

SW-only task distribution: the headline time horizon doesn't transfer to other task types. Visual computer-use tasks have 40-100x lower time horizons.

Low-context assumption: tasks are self-contained briefs with all necessary information. Real professional work involves institutional context, relationships, and shifting requirements that would lower performance.

Near-saturation problem: the current suite is becoming too easy for top models, and adding 8+ hour tasks is logistically difficult (requires human contractors to baseline).

Doubling rate extrapolation is linear-in-log-scale — assumes the current trend continues unchanged, which may not hold as task types become qualitatively more complex.

Links