DABStep

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

DABStep

OpenReward Environment

Description

DABStep (Data Agent Benchmark for Multi-Step Reasoning) is an ORS environment for evaluating language model agents on data analysis tasks involving payment transaction data. Agents are given access to CSV files, JSON configuration data, and documentation, and must answer questions about fees, merchants, and payment processing by writing and executing Python/bash code in a sandboxed environment.

Based on the DABStep benchmark by Egg et al. (2025).

Note: the gold answers and grading implementation were not provided by the original paper and were reverse-engineered.

Capabilities

  • Reading and parsing large CSV and JSON data files
  • Writing Python code for data analysis with pandas
  • Answering quantitative questions about payment transactions
  • Multi-step reasoning with file system exploration

Compute Requirements

Agents are given a sandbox with 0.5 CPUs and 1GB RAM, with pandas pre-installed.

License

ORLv1.

Tasks

There are 450 tasks across 3 splits:

SplitTasksDescription
easy72Simpler data lookup and aggregation questions
hard378Complex multi-step analysis questions
all450All tasks combined

Each task provides a question about payment transaction data along with guidelines for formatting the answer. The agent must explore the provided data files, write analysis code, and submit a final answer.

Reward Structure

This environment uses binary, verifiable rewards. Answers are graded against ground truth using a comprehensive normalization and matching system that handles numbers (with tolerance), lists, categorical values, and special formats (card schemes, label-value pairs, etc.). The reward is 1.0 for correct answers and 0.0 for incorrect answers.

We do not use LLM graders for this task.

Data

Agents are given access to the following files in the sandbox:

  • payments.csv (~23MB) - Payment transaction records
  • fees.json - Fee structure configuration
  • merchant_data.json - Merchant information
  • merchant_category_codes.csv - MCC lookup table
  • acquirer_countries.csv - Acquirer country mappings
  • manual.md - Documentation for the payment system
  • payments-readme.md - Data dictionary for payments.csv

Tools

Agents are given access to CLI tools for creating, viewing, and searching a filesystem (bash, glob, grep, ls, read, write, edit, multi_edit, todo_write). They also have one environment-specific tool:

  • answer: Submit the final answer for grading against ground truth.

Time Horizon

DABStep is a medium-horizon environment. Each task typically requires several tool calls to explore data, write analysis code, and submit an answer.

Other Environment Requirements

There are no further environment requirements beyond the OpenReward platform.

Safety

Agents in DABStep interact with synthetic payment data within a sandboxed environment. No real financial data or systems are involved.

Citations

@article{Egg2025DABstep,
  title={DABstep: Data Agent Benchmark for Multi-step Reasoning},
  author={Egg, Alex and Goyanes, Martin Iglesias and Kingma, Friso and Mora, Andreu and von Werra, Leandro and Wolf, Thomas},
  journal={arXiv preprint arXiv:2506.23719},
  year={2025}
}
GeneralReasoning/DABStep | OpenReward