benchmark.darvinyi.com
← Back to Benchmarks
Agent TasksActive

τ-bench (tau-bench)

AI customer service agents that must follow policy while solving real customer problems.

Tasks200
Year2024
CreatorShunyu Yao, et al.
MetricPass@1 and Pass^k (k=1,8) per domain

What It Tests

τ-bench (tau-bench) evaluates AI agents on customer service tasks requiring simultaneous: (1) tool use (calling domain-specific APIs), (2) natural conversation with a simulated user, and (3) strict adherence to business policy. Created by Shunyu Yao and colleagues at Sierra Research and Princeton, published in 2024.

The setup simulates real enterprise customer service: the agent has access to domain-specific API tools (flight booking system, retail order management), a detailed policy document it must follow, and interacts with a simulated user (another LLM acting as the customer). The agent must satisfy the customer's request while following all applicable policy rules — often a tension-filled constraint.

The benchmark introduced the Pass^k metric: the probability that a task is solved correctly k times out of k consecutive runs. This penalizes flaky agents — a model that succeeds 50% of the time is not production-ready for customer service where failure means a bad customer experience. Pass^1 measures single-attempt success; Pass^8 measures consistent reliability.

Two primary domains: Airline (change/cancel flights, manage seats, apply upgrades) and Retail (returns, refunds, order modifications, account updates). τ²-bench and τ³-bench (2025) added banking, telecom, voice mode, and fixed 75+ quality issues in the original tasks.

Top models score 84.7% on Retail but only 56% on Airline — policy complexity matters significantly.

Task Anatomy

How a single task is structured.

InputA customer task description (from the simulated user's perspective) plus: the domain-specific API documentation (tools the agent can call), and a policy document listing all applicable business rules.
OutputA multi-turn dialogue in which the agent calls APIs, asks clarifying questions, and ultimately resolves the customer's issue — correctly following all policy constraints.
EvaluationBinary: did the agent (1) complete the customer's request, (2) call the correct APIs with correct parameters, and (3) follow all applicable policy rules? All three must be true for a task to count as solved. Pass^k = probability of success on all k consecutive runs.
MetricPass@1 and Pass^k (k=1,8) per domain

Example Tasks

2 real examples from the benchmark.

#1

Airline — Flight change with fare difference

Medium

Problem / Input

Customer: I need to change my flight from LAX to JFK from tomorrow to the day after tomorrow. My booking number is AA-4821.

Agent has: flight_search API, booking_modify API, payment_process API
Policy: Basic fare tickets can be changed up to 24h before departure for a $50 fee plus any fare difference. Fare difference is waived for Gold members.
AnswerBooking modified with correct charges applied per policy

A successful resolution requires getting all policy details right: base change fee, fare difference calculation, and member status check. Missing any one of these fails the task.

#2

Retail — Return policy conflict

Hard

Problem / Input

Customer: I want to return my laptop that I bought 45 days ago. The screen cracked but I didn't do it — it was a defect. Agent has: order_lookup API, return_create API, warranty_check API Policy: Standard returns accepted within 30 days. Defective items can be returned within 90 days of purchase. Customer must provide defect description.
AnswerReturn authorized under defective item policy with proper documentation

The agent must recognize that two policies apply (standard 30-day vs. defective item 90-day) and correctly apply the exception path rather than refusing the return outright.

Leaderboard Results

Model scores sorted by performance.

6 results

Sort:
#ModelScore
1
Sonnet 4
84.7%
2
GPT-5
~79%
3
Opus 4.5
~56%
4
GPT-5
~54%
5
Gemini 2.5 Pro
~48%
6
GPT-4o
~25%

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

  • Pass^k metric insight: GPT-4o has <25% Pass^8 on Retail, meaning it fails at least once in any 8-attempt sequence. A customer service bot that fails 1 in 8 interactions is not production-ready.

  • Airline domain is significantly harder than Retail (56% vs 84.7% for best models) — flight booking involves more complex multi-step policy rules with financial and legal implications.

  • The benchmark revealed that models which score well on static benchmarks often fail when required to maintain policy compliance across multiple turns of dynamic conversation.

  • Tool use errors are the primary failure mode: models either call the wrong API, pass incorrect parameters, or fail to call necessary APIs before making commitments to customers.

Variants & Related

τ²-bench

Active

Voice mode support, fixed 75+ quality issues from the original. More reliable evaluation.

τ³-bench

Active

Adds Banking and Telecom domains. The most comprehensive version.

Controversies & Caveats

Known limitations and criticisms.

The simulated user (another LLM) may behave differently from real human customers, particularly in edge cases and escalations.

Policy documents are synthetic — real enterprise policies are more complex, ambiguous, and full of special cases that aren't captured.

Pass^k penalizes even occasional failures, which may be too strict — real enterprise deployments typically have human escalation paths for edge cases.

Links