dabstep

Name: arXiv/dabstep
Author: arXiv

arXiv/dabstep

Description

DABstep is a benchmark for evaluating AI agents on realistic multi-step data analysis tasks, comprising over 450 real-world challenges derived from a financial analytics platform that require combining code-based data processing with contextual reasoning over heterogeneous documentation. Each task demands iterative multi-step problem-solving with factoid-style answers and automatic correctness checks for objective scoring at scale, and the benchmark is released with a public leaderboard and toolkit to accelerate research in autonomous data analysis.

arXiv

Leaderboard

Loading leaderboard...

Implementations (1)

Environment	Stars	Last Updated
GeneralReasoning/DABStep	0	3 months ago