DABStep
DABStep
Description
DABStep (Data Agent Benchmark for Multi-Step Reasoning) is an ORS environment for evaluating language model agents on data analysis tasks involving payment transaction data. Agents are given access to CSV files, JSON configuration data, and documentation, and must answer questions about fees, merchants, and payment processing by writing and executing Python/bash code in a sandboxed environment.
Based on the DABStep benchmark by Egg et al. (2025).
Note: the gold answers and grading implementation were not provided by the original paper and were reverse-engineered.
Capabilities
- Reading and parsing large CSV and JSON data files
- Writing Python code for data analysis with pandas
- Answering quantitative questions about payment transactions
- Multi-step reasoning with file system exploration
Compute Requirements
Agents are given a sandbox with 0.5 CPUs and 1GB RAM, with pandas pre-installed.
License
Tasks
There are 450 tasks across 3 splits:
| Split | Tasks | Description |
|---|---|---|
| easy | 72 | Simpler data lookup and aggregation questions |
| hard | 378 | Complex multi-step analysis questions |
| all | 450 | All tasks combined |
Each task provides a question about payment transaction data along with guidelines for formatting the answer. The agent must explore the provided data files, write analysis code, and submit a final answer.
Reward Structure
This environment uses binary, verifiable rewards. Answers are graded against ground truth using a comprehensive normalization and matching system that handles numbers (with tolerance), lists, categorical values, and special formats (card schemes, label-value pairs, etc.). The reward is 1.0 for correct answers and 0.0 for incorrect answers.
We do not use LLM graders for this task.
Data
Agents are given access to the following files in the sandbox:
payments.csv(~23MB) - Payment transaction recordsfees.json- Fee structure configurationmerchant_data.json- Merchant informationmerchant_category_codes.csv- MCC lookup tableacquirer_countries.csv- Acquirer country mappingsmanual.md- Documentation for the payment systempayments-readme.md- Data dictionary for payments.csv
Tools
Agents are given access to CLI tools for creating, viewing, and searching a filesystem (bash, glob, grep, ls, read, write, edit, multi_edit, todo_write). They also have one environment-specific tool:
answer: Submit the final answer for grading against ground truth.
Time Horizon
DABStep is a medium-horizon environment. Each task typically requires several tool calls to explore data, write analysis code, and submit an answer.
Other Environment Requirements
There are no further environment requirements beyond the OpenReward platform.
Safety
Agents in DABStep interact with synthetic payment data within a sandboxed environment. No real financial data or systems are involved.
Citations
@article{Egg2025DABstep,
title={DABstep: Data Agent Benchmark for Multi-step Reasoning},
author={Egg, Alex and Goyanes, Martin Iglesias and Kingma, Friso and Mora, Andreu and von Werra, Leandro and Wolf, Thomas},
journal={arXiv preprint arXiv:2506.23719},
year={2025}
}