τ-bench (tau-bench)
AI customer service agents that must follow policy while solving real customer problems.
What It Tests
τ-bench (tau-bench) evaluates AI agents on customer service tasks requiring simultaneous: (1) tool use (calling domain-specific APIs), (2) natural conversation with a simulated user, and (3) strict adherence to business policy. Created by Shunyu Yao and colleagues at Sierra Research and Princeton, published in 2024.
The setup simulates real enterprise customer service: the agent has access to domain-specific API tools (flight booking system, retail order management), a detailed policy document it must follow, and interacts with a simulated user (another LLM acting as the customer). The agent must satisfy the customer's request while following all applicable policy rules — often a tension-filled constraint.
The benchmark introduced the Pass^k metric: the probability that a task is solved correctly k times out of k consecutive runs. This penalizes flaky agents — a model that succeeds 50% of the time is not production-ready for customer service where failure means a bad customer experience. Pass^1 measures single-attempt success; Pass^8 measures consistent reliability.
Two primary domains: Airline (change/cancel flights, manage seats, apply upgrades) and Retail (returns, refunds, order modifications, account updates). τ²-bench and τ³-bench (2025) added banking, telecom, voice mode, and fixed 75+ quality issues in the original tasks.
Top models score 84.7% on Retail but only 56% on Airline — policy complexity matters significantly.
Task Anatomy
How a single task is structured.
Example Tasks
2 real examples from the benchmark.
Airline — Flight change with fare difference
Problem / Input
Customer: I need to change my flight from LAX to JFK from tomorrow to the day after tomorrow. My booking number is AA-4821.
Agent has: flight_search API, booking_modify API, payment_process API
Policy: Basic fare tickets can be changed up to 24h before departure for a $50 fee plus any fare difference. Fare difference is waived for Gold members.A successful resolution requires getting all policy details right: base change fee, fare difference calculation, and member status check. Missing any one of these fails the task.
Retail — Return policy conflict
Problem / Input
The agent must recognize that two policies apply (standard 30-day vs. defective item 90-day) and correctly apply the exception path rather than refusing the return outright.
Leaderboard Results
Model scores sorted by performance.
6 results
| # | Model | Score |
|---|---|---|
| 1 | Sonnet 4 | 84.7% |
| 2 | GPT-5 | ~79% |
| 3 | Opus 4.5 | ~56% |
| 4 | GPT-5 | ~54% |
| 5 | Gemini 2.5 Pro | ~48% |
| 6 | GPT-4o | ~25% |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
Pass^k metric insight: GPT-4o has <25% Pass^8 on Retail, meaning it fails at least once in any 8-attempt sequence. A customer service bot that fails 1 in 8 interactions is not production-ready.
Airline domain is significantly harder than Retail (56% vs 84.7% for best models) — flight booking involves more complex multi-step policy rules with financial and legal implications.
The benchmark revealed that models which score well on static benchmarks often fail when required to maintain policy compliance across multiple turns of dynamic conversation.
Tool use errors are the primary failure mode: models either call the wrong API, pass incorrect parameters, or fail to call necessary APIs before making commitments to customers.
Variants & Related
τ²-bench
Voice mode support, fixed 75+ quality issues from the original. More reliable evaluation.
τ³-bench
Adds Banking and Telecom domains. The most comprehensive version.
Controversies & Caveats
Known limitations and criticisms.
The simulated user (another LLM) may behave differently from real human customers, particularly in edge cases and escalations.
Policy documents are synthetic — real enterprise policies are more complex, ambiguous, and full of special cases that aren't captured.
Pass^k penalizes even occasional failures, which may be too strict — real enterprise deployments typically have human escalation paths for edge cases.