dabstep
Description
DABstep is a benchmark for evaluating AI agents on realistic multi-step data analysis tasks, comprising over 450 real-world challenges derived from a financial analytics platform that require combining code-based data processing with contextual reasoning over heterogeneous documentation. Each task demands iterative multi-step problem-solving with factoid-style answers and automatic correctness checks for objective scoring at scale, and the benchmark is released with a public leaderboard and toolkit to accelerate research in autonomous data analysis.
Leaderboard
Loading leaderboard...
Implementations (1)
| Environment | Stars | Last Updated | |
|---|---|---|---|
0 | 1 months ago |