tau2-bench
Description
τ^2-bench is a benchmark for evaluating conversational AI agents in a Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user use tools to act in a shared, dynamic environment that stresses coordination, communication, and guidance. It programmatically generates diverse, verifiable tasks, includes a tool-constrained user simulator tightly coupled to the environment, and enables fine-grained analysis separating reasoning errors from communication/coordination failures.
Leaderboard
Loading leaderboard...
Implementations (1)
| Environment | Stars | Last Updated | |
|---|---|---|---|
1 | 1 months ago |