tau2-bench

Description

τ^2-bench is a benchmark for evaluating conversational AI agents in a Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user use tools to act in a shared, dynamic environment that stresses coordination, communication, and guidance. It programmatically generates diverse, verifiable tasks, includes a tool-constrained user simulator tightly coupled to the environment, and enables fine-grained analysis separating reasoning errors from communication/coordination failures.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
GeneralReasoningGeneralReasoning/tau2bench
1
1 months ago
Sierra/tau2-bench | OpenReward