DSBench
API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
DSBench
Description
DSBench is an ORS environment for evaluating data science agents on realistic data analysis and data modeling tasks. It is based on the DSBench benchmark (ICLR 2025), which collects tasks from ModelOff financial modeling competitions and Kaggle machine learning challenges.
The environment provides two variants:
- DSBenchAnalysis: Single-turn data analysis questions over Excel workbooks sourced from ModelOff competitions (2012-2017). The agent is given the workbook data, background context, and a question, and must submit a final answer.
- DSBenchModeling: Multi-step machine learning tasks sourced from Kaggle competitions. The agent is given a sandbox with training data, test data, and a sample submission file, and must produce a predictions CSV.
Capabilities
- Financial data analysis and reasoning over complex Excel workbooks
- End-to-end machine learning modeling (data exploration, feature engineering, model training, prediction)
- Code execution in isolated sandboxes
- Evaluation against Kaggle competition metrics
Compute Requirements
- Analysis: No sandbox required (single-turn evaluation)
- Modeling: Agents are given a sandbox with 2 CPUs and 2GB of RAM
License
MIT.
Tasks
All tasks are in a single test split:
- Analysis: 466 data analysis questions across 38 task families from ModelOff competitions
- Modeling: 74 active machine learning tasks from Kaggle competitions (18 excluded due to data issues)
Reward Structure
Analysis: Binary reward (0 or 1) determined by an LLM judge comparing the agent's answer against the ground truth.
Modeling: Continuous reward normalized between baseline and ground truth performance:
For metrics where lower is better (e.g., RMSE), the formula is inverted. Rewards are clamped to [0, 1].
Data
- Analysis data: Excel workbooks and task descriptions from ModelOff/Eloquence financial modeling competitions
- Modeling data: Kaggle competition datasets, stored in cloud storage and mounted into sandboxes
Tools
Analysis variant (1 tool):
answer- Submit a final answer for grading
Modeling variant (6 tools):
bash- Execute bash commands in the sandboxview- View file contents with optional line rangesstr_replace- Replace text in filesinsert- Insert content at a line numbercreate- Create a file with contentanswer- Submit a predictions CSV for evaluation
Other Environment Requirements
- Analysis: Requires
openai_api_keysecret for LLM-based grading - Modeling: Requires
api_keysecret (OpenReward API key) for sandbox provisioning
Citations
@inproceedings{jing2025dsbench,
title={DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?},
author={Liqiang Jing and Zhehui Huang and Xiaoyang Wang and Wenlin Yao and Wenhao Yu and Kaixin Ma and Hongming Zhang and Xinya Du and Dong Yu},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=DSsSPr0RZJ}
}