tau2bench
tau2bench
Description
tau2bench is an OpenReward environment that implements τ²-Bench, a benchmark for evaluating conversational agents in dual-control customer service environments. The environment spans three realistic domains - airline, retail, and telecom - where an agent must follow domain-specific policies while using tools to resolve simulated customer issues.
In a dual-control setting, both the agent and the user can modify the shared environment state. The agent uses domain tools (e.g., booking flights, modifying orders, managing telecom lines), while a simulated user drives the conversation with goals drawn from structured task scenarios. In the telecom domain, the user simulator also has access to its own tools, creating a richer coordination challenge.
Capabilities
- Multi-turn conversational customer service across airline, retail, and telecom domains
- Following complex domain-specific policy documents
- Coordinating with simulated users who actively modify shared state
- Using domain-specific tools for booking, cancellation, account management, and more
Compute Requirements
No sandbox is required for this environment.
License
Tasks
tau2bench contains three sub-environments across three customer service domains. The base split is the recommended split for evaluation -- it matches the original τ²-Bench evaluation methodology and should be used for benchmarking. The train split comes from the upstream repository and is intended for RL training.
Airline (Tau2AirlineEnvironment)
50 total tasks covering flight booking, reservation management, and travel support scenarios.
Retail (Tau2RetailEnvironment)
114 total tasks covering e-commerce order management, returns, exchanges, and customer account operations.
Telecom (Tau2TelecomEnvironment)
2,285 total tasks generated compositionally from 15 subtask groups, covering line management, billing, data usage, and roaming scenarios.
Each task defines a user scenario with specific goals and an evaluation rubric. The agent is given the domain policy and an initial customer message, then must converse with the simulated user and take appropriate tool actions to resolve the issue.
Reward Structure
Reward is evaluated at the end of each conversation. The environment uses a multiplicative aggregation of up to five binary reward dimensions:
| Dimension | Description |
|---|---|
| ACTION | Were the correct tool actions executed? |
| COMMUNICATE | Was required information communicated to the user? |
| DB | Does the final database state match the expected state? |
| ENV_ASSERTION | Do domain-specific environment assertions pass? |
| NL_ASSERTION | Do natural language assertions hold (evaluated by LLM grader)? |
Each task specifies which reward dimensions are active. The final reward is 1.0 only if all active dimensions pass; any single failure yields 0.0.
This environment uses an LLM grader (gpt-4o-mini) for the NL_ASSERTION dimension only. All other dimensions are verified programmatically.
Data
Task specifications and domain databases are sourced from the upstream τ²-Bench repository and stored on the OpenReward platform.
Tools
Each domain provides domain-specific tools plus 3 common tools shared across all domains.
Common tools (all domains):
respond_to_user-- Send a message to the customer and receive their responsethink-- Record a chain-of-thought step (not visible to user)calculate-- Evaluate a mathematical expression
Airline (13 domain tools):
get_user_details, get_reservation_details, search_direct_flight, search_onestop_flight, get_flight_status, book_reservation, update_reservation_flights, update_reservation_passengers, update_reservation_baggages, cancel_reservation, send_certificate, transfer_to_human_agents, list_all_airports
Retail (14 domain tools):
find_user_id_by_email, find_user_id_by_name_zip, get_user_details, get_order_details, get_product_details, list_all_product_types, modify_pending_order_items, modify_pending_order_payment, modify_pending_order_address, modify_user_address, cancel_pending_order, return_delivered_order_items, exchange_delivered_order_items, transfer_to_human_agents
Telecom (13 domain tools):
get_customer_by_phone, get_customer_by_id, get_customer_by_name, get_details_by_id, suspend_line, resume_line, get_bills_for_customer, send_payment_request, get_data_usage, enable_roaming, disable_roaming, transfer_to_human_agents, refuel_data
Time Horizon
tau2bench is a multi-turn conversational environment. Each task is a single conversation between the agent and a simulated user, with a maximum of 100 turns. Each turn, the agent either uses a domain tool to look up information or take an action, or sends a message to the customer via respond_to_user. The conversation ends when the user simulator emits a stop token or the turn limit is reached.
Environment Difficulty
Example scores on τ²-Bench (Pass^1 across all domains):
| Rank | Model | Score |
|---|---|---|
| 1 | GLM-5 | 89.7 |
| 2 | Step 3.5 Flash | 88.2 |
| 3 | Qwen3.5-397B-A17B | 87.1 |
| 4 | Gemini 3 Pro | 85.4 |
| 5 | DeepSeek-V3.2 Thinking | 80.3 |
Other Environment Requirements
Requires an OpenAI API key (openai_api_key secret) for:
- User simulation (gpt-4.1)
- NL assertion evaluation (gpt-4o-mini)
Safety
Agents in tau2bench interact only with simulated customers and synthetic databases. No real customer data is used, and no real-world actions are taken.
Citations
@misc{barres2025tau2,
title={$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment},
author={Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan},
year={2025},
eprint={2506.07982},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.07982},
}