API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/dabstep

README

DABStep

Description

DABStep (Data Agent Benchmark for Multi-Step Reasoning) is an ORS environment for evaluating language model agents on data analysis tasks involving payment transaction data. Agents are given access to CSV files, JSON configuration data, and documentation, and must answer questions about fees, merchants, and payment processing by writing and executing Python/bash code in a sandboxed environment.

Based on the DABStep benchmark by Egg et al. (2025).

Note: the gold answers and grading implementation were not provided by the original paper and were reverse-engineered.

Capabilities

Reading and parsing large CSV and JSON data files
Writing Python code for data analysis with pandas
Answering quantitative questions about payment transactions
Multi-step reasoning with file system exploration

Compute Requirements

Agents are given a sandbox with 0.5 CPUs and 1GB RAM, with pandas pre-installed.

License

ORLv1.

Tasks

There are 450 tasks across 3 splits:

Split	Tasks	Description
easy	72	Simpler data lookup and aggregation questions
hard	378	Complex multi-step analysis questions
all	450	All tasks combined

Each task provides a question about payment transaction data along with guidelines for formatting the answer. The agent must explore the provided data files, write analysis code, and submit a final answer.

Reward Structure

This environment uses binary, verifiable rewards. Answers are graded against ground truth using a comprehensive normalization and matching system that handles numbers (with tolerance), lists, categorical values, and special formats (card schemes, label-value pairs, etc.). The reward is 1.0 for correct answers and 0.0 for incorrect answers.

We do not use LLM graders for this task.

Data

Agents are given access to the following files in the sandbox:

payments.csv (~23MB) - Payment transaction records
fees.json - Fee structure configuration
merchant_data.json - Merchant information
merchant_category_codes.csv - MCC lookup table
acquirer_countries.csv - Acquirer country mappings
manual.md - Documentation for the payment system
payments-readme.md - Data dictionary for payments.csv

Tools

Agents are given access to CLI tools for creating, viewing, and searching a filesystem (bash, glob, grep, ls, read, write, edit, multi_edit, todo_write). They also have one environment-specific tool:

answer: Submit the final answer for grading against ground truth.

Time Horizon

DABStep is a medium-horizon environment. Each task typically requires several tool calls to explore data, write analysis code, and submit an answer.

Other Environment Requirements

There are no further environment requirements beyond the OpenReward platform.

Safety

Agents in DABStep interact with synthetic payment data within a sandboxed environment. No real financial data or systems are involved.

Citations

@article{Egg2025DABstep,
  title={DABstep: Data Agent Benchmark for Multi-step Reasoning},
  author={Egg, Alex and Goyanes, Martin Iglesias and Kingma, Friso and Mora, Andreu and von Werra, Leandro and Wolf, Thomas},
  journal={arXiv preprint arXiv:2506.23719},
  year={2025}
}

Repository

Source repository

EnvCommons/DABStep

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	0.5 vCPUs / 1 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000115
Total	$0.0000435

Examples

5-minute session$0.0131

1-hour session$0.1566

DABStep

GeneralReasoning/DABStep

DABStep

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples