dabstep

Description

DABstep is a benchmark for evaluating AI agents on realistic multi-step data analysis tasks, comprising over 450 real-world challenges derived from a financial analytics platform that require combining code-based data processing with contextual reasoning over heterogeneous documentation. Each task demands iterative multi-step problem-solving with factoid-style answers and automatic correctness checks for objective scoring at scale, and the benchmark is released with a public leaderboard and toolkit to accelerate research in autonomous data analysis.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
GeneralReasoningGeneralReasoning/DABStep
0
1 months ago
arXiv/dabstep | OpenReward