tau2-bench

Name: Sierra/tau2-bench
Author: Sierra

Sierra/tau2-bench

Telecom Dual-Control Conversational Agent Evaluation

Description

τ^2-bench is a benchmark for evaluating conversational AI agents in a Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user use tools to act in a shared, dynamic environment that stresses coordination, communication, and guidance. It programmatically generates diverse, verifiable tasks, includes a tool-constrained user simulator tightly coupled to the environment, and enables fine-grained analysis separating reasoning errors from communication/coordination failures.

arXiv

Leaderboard

Loading leaderboard...

Implementations (1)

Environment	Stars	Last Updated
GeneralReasoning/tau2bench	1	3 months ago