tau2bench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

tau2bench

OpenReward Environment

Description

tau2bench is an OpenReward environment that implements τ²-Bench, a benchmark for evaluating conversational agents in dual-control customer service environments. The environment spans three realistic domains - airline, retail, and telecom - where an agent must follow domain-specific policies while using tools to resolve simulated customer issues.

In a dual-control setting, both the agent and the user can modify the shared environment state. The agent uses domain tools (e.g., booking flights, modifying orders, managing telecom lines), while a simulated user drives the conversation with goals drawn from structured task scenarios. In the telecom domain, the user simulator also has access to its own tools, creating a richer coordination challenge.

Capabilities

  • Multi-turn conversational customer service across airline, retail, and telecom domains
  • Following complex domain-specific policy documents
  • Coordinating with simulated users who actively modify shared state
  • Using domain-specific tools for booking, cancellation, account management, and more

Compute Requirements

No sandbox is required for this environment.

License

MIT

Tasks

tau2bench contains three sub-environments across three customer service domains. The base split is the recommended split for evaluation -- it matches the original τ²-Bench evaluation methodology and should be used for benchmarking. The train split comes from the upstream repository and is intended for RL training.

Airline (Tau2AirlineEnvironment)

50 total tasks covering flight booking, reservation management, and travel support scenarios.

Retail (Tau2RetailEnvironment)

114 total tasks covering e-commerce order management, returns, exchanges, and customer account operations.

Telecom (Tau2TelecomEnvironment)

2,285 total tasks generated compositionally from 15 subtask groups, covering line management, billing, data usage, and roaming scenarios.

Each task defines a user scenario with specific goals and an evaluation rubric. The agent is given the domain policy and an initial customer message, then must converse with the simulated user and take appropriate tool actions to resolve the issue.

Reward Structure

Reward is evaluated at the end of each conversation. The environment uses a multiplicative aggregation of up to five binary reward dimensions:

R=iactiveri,ri{0,1}R = \prod_{i \in \text{active}} r_i, \quad r_i \in \{0, 1\}

DimensionDescription
ACTIONWere the correct tool actions executed?
COMMUNICATEWas required information communicated to the user?
DBDoes the final database state match the expected state?
ENV_ASSERTIONDo domain-specific environment assertions pass?
NL_ASSERTIONDo natural language assertions hold (evaluated by LLM grader)?

Each task specifies which reward dimensions are active. The final reward is 1.0 only if all active dimensions pass; any single failure yields 0.0.

This environment uses an LLM grader (gpt-4o-mini) for the NL_ASSERTION dimension only. All other dimensions are verified programmatically.

Data

Task specifications and domain databases are sourced from the upstream τ²-Bench repository and stored on the OpenReward platform.

Tools

Each domain provides domain-specific tools plus 3 common tools shared across all domains.

Common tools (all domains):

  • respond_to_user -- Send a message to the customer and receive their response
  • think -- Record a chain-of-thought step (not visible to user)
  • calculate -- Evaluate a mathematical expression

Airline (13 domain tools):
get_user_details, get_reservation_details, search_direct_flight, search_onestop_flight, get_flight_status, book_reservation, update_reservation_flights, update_reservation_passengers, update_reservation_baggages, cancel_reservation, send_certificate, transfer_to_human_agents, list_all_airports

Retail (14 domain tools):
find_user_id_by_email, find_user_id_by_name_zip, get_user_details, get_order_details, get_product_details, list_all_product_types, modify_pending_order_items, modify_pending_order_payment, modify_pending_order_address, modify_user_address, cancel_pending_order, return_delivered_order_items, exchange_delivered_order_items, transfer_to_human_agents

Telecom (13 domain tools):
get_customer_by_phone, get_customer_by_id, get_customer_by_name, get_details_by_id, suspend_line, resume_line, get_bills_for_customer, send_payment_request, get_data_usage, enable_roaming, disable_roaming, transfer_to_human_agents, refuel_data

Time Horizon

tau2bench is a multi-turn conversational environment. Each task is a single conversation between the agent and a simulated user, with a maximum of 100 turns. Each turn, the agent either uses a domain tool to look up information or take an action, or sends a message to the customer via respond_to_user. The conversation ends when the user simulator emits a stop token or the turn limit is reached.

Environment Difficulty

Example scores on τ²-Bench (Pass^1 across all domains):

RankModelScore
1GLM-589.7
2Step 3.5 Flash88.2
3Qwen3.5-397B-A17B87.1
4Gemini 3 Pro85.4
5DeepSeek-V3.2 Thinking80.3

Other Environment Requirements

Requires an OpenAI API key (openai_api_key secret) for:

  • User simulation (gpt-4.1)
  • NL assertion evaluation (gpt-4o-mini)

Safety

Agents in tau2bench interact only with simulated customers and synthetic databases. No real customer data is used, and no real-world actions are taken.

Citations

@misc{barres2025tau2,
      title={$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment},
      author={Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan},
      year={2025},
      eprint={2506.07982},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.07982},
}
GeneralReasoning/tau2bench | OpenReward