API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

Sierra/tau2-bench

README

tau2bench

Description

tau2bench is an OpenReward environment that implements τ²-Bench, a benchmark for evaluating conversational agents in dual-control customer service environments. The environment spans three realistic domains - airline, retail, and telecom - where an agent must follow domain-specific policies while using tools to resolve simulated customer issues.

In a dual-control setting, both the agent and the user can modify the shared environment state. The agent uses domain tools (e.g., booking flights, modifying orders, managing telecom lines), while a simulated user drives the conversation with goals drawn from structured task scenarios. In the telecom domain, the user simulator also has access to its own tools, creating a richer coordination challenge.

Capabilities

Multi-turn conversational customer service across airline, retail, and telecom domains
Following complex domain-specific policy documents
Coordinating with simulated users who actively modify shared state
Using domain-specific tools for booking, cancellation, account management, and more

Compute Requirements

No sandbox is required for this environment.

License

MIT

Tasks

tau2bench contains three sub-environments across three customer service domains. The base split is the recommended split for evaluation -- it matches the original τ²-Bench evaluation methodology and should be used for benchmarking. The train split comes from the upstream repository and is intended for RL training.

Airline (`Tau2AirlineEnvironment`)

50 total tasks covering flight booking, reservation management, and travel support scenarios.

Retail (`Tau2RetailEnvironment`)

114 total tasks covering e-commerce order management, returns, exchanges, and customer account operations.

Telecom (`Tau2TelecomEnvironment`)

2,285 total tasks generated compositionally from 15 subtask groups, covering line management, billing, data usage, and roaming scenarios.

Each task defines a user scenario with specific goals and an evaluation rubric. The agent is given the domain policy and an initial customer message, then must converse with the simulated user and take appropriate tool actions to resolve the issue.

Reward Structure

Reward is evaluated at the end of each conversation. The environment uses a multiplicative aggregation of up to five binary reward dimensions:

$R = \prod_{i \in \text{active}} r_i, \quad r_i \in \{0, 1\}$

Dimension	Description
ACTION	Were the correct tool actions executed?
COMMUNICATE	Was required information communicated to the user?
DB	Does the final database state match the expected state?
ENV_ASSERTION	Do domain-specific environment assertions pass?
NL_ASSERTION	Do natural language assertions hold (evaluated by LLM grader)?

Each task specifies which reward dimensions are active. The final reward is 1.0 only if all active dimensions pass; any single failure yields 0.0.

This environment uses an LLM grader (gpt-4o-mini) for the NL_ASSERTION dimension only. All other dimensions are verified programmatically.

Data

Task specifications and domain databases are sourced from the upstream τ²-Bench repository and stored on the OpenReward platform.

Tools

Each domain provides domain-specific tools plus 3 common tools shared across all domains.

Common tools (all domains):

respond_to_user -- Send a message to the customer and receive their response
think -- Record a chain-of-thought step (not visible to user)
calculate -- Evaluate a mathematical expression

Airline (13 domain tools):
get_user_details, get_reservation_details, search_direct_flight, search_onestop_flight, get_flight_status, book_reservation, update_reservation_flights, update_reservation_passengers, update_reservation_baggages, cancel_reservation, send_certificate, transfer_to_human_agents, list_all_airports

Retail (14 domain tools):
find_user_id_by_email, find_user_id_by_name_zip, get_user_details, get_order_details, get_product_details, list_all_product_types, modify_pending_order_items, modify_pending_order_payment, modify_pending_order_address, modify_user_address, cancel_pending_order, return_delivered_order_items, exchange_delivered_order_items, transfer_to_human_agents

Telecom (13 domain tools):
get_customer_by_phone, get_customer_by_id, get_customer_by_name, get_details_by_id, suspend_line, resume_line, get_bills_for_customer, send_payment_request, get_data_usage, enable_roaming, disable_roaming, transfer_to_human_agents, refuel_data

Time Horizon

tau2bench is a multi-turn conversational environment. Each task is a single conversation between the agent and a simulated user, with a maximum of 100 turns. Each turn, the agent either uses a domain tool to look up information or take an action, or sends a message to the customer via respond_to_user. The conversation ends when the user simulator emits a stop token or the turn limit is reached.

Environment Difficulty

Example scores on τ²-Bench (Pass^1 across all domains):

Rank	Model	Score
1	GLM-5	89.7
2	Step 3.5 Flash	88.2
3	Qwen3.5-397B-A17B	87.1
4	Gemini 3 Pro	85.4
5	DeepSeek-V3.2 Thinking	80.3

Other Environment Requirements

Requires an OpenAI API key (openai_api_key secret) for:

User simulation (gpt-4.1)
NL assertion evaluation (gpt-4o-mini)

Safety

Agents in tau2bench interact only with simulated customers and synthetic databases. No real customer data is used, and no real-world actions are taken.

Citations

@misc{barres2025tau2,
      title={$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment},
      author={Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan},
      year={2025},
      eprint={2506.07982},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.07982},
}

Repository

Source repository

EnvCommons/tau2benchenv

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

tau2bench

GeneralReasoning/tau2bench

tau2bench

Description

Capabilities

Compute Requirements

License

Tasks

Airline (Tau2AirlineEnvironment)

Retail (Tau2RetailEnvironment)

Telecom (Tau2TelecomEnvironment)

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples

Airline (`Tau2AirlineEnvironment`)

Retail (`Tau2RetailEnvironment`)

Telecom (`Tau2TelecomEnvironment`)