API Endpoint

Leaderboard

Loading leaderboard...

README

DS-1000

Description

DS-1000 is an environment for evaluating Python code completion across 7 popular data science libraries. Agents receive a coding problem with context and must write the missing code to complete the solution, which is then tested against automated test cases.

Capabilities

Python code generation and completion
Data science library usage (NumPy, Pandas, SciPy, Sklearn, Matplotlib, PyTorch, TensorFlow)
Interactive code debugging and testing via sandbox

Compute Requirements

Agents are given a sandboxed Docker environment with a pre-built instance image for each task. Default sandbox size is 1 CPU and 2 GB RAM. Network access enabled. No GPU required.

Tasks

Test split: 1,000 tasks (single split, matching the original DS-1000 dataset)
Each task is a code completion problem for one of 7 libraries: Pandas (291), NumPy (220), Matplotlib (155), Sklearn (115), SciPy (106), PyTorch (68), TensorFlow (45)

Reward Structure

Binary reward:

1.0: Submitted code passes all automated test cases (functional correctness + optional string validation)
0.0: Code fails tests, produces errors, or times out

Data

Source: xlangai/DS-1000 on HuggingFace
Splits: The original DS-1000 has no train/test split — all 1,000 problems are in a single "test" set. We preserve this as a single test split.
Format: Single parquet file with prompt, code context, reference code, and metadata

Tools

bash: Execute shell commands in the sandbox (for experimentation and testing)
submit: Submit a code completion for automated evaluation (terminal action, one attempt)

Time Horizon

Multi-turn. Agents can use multiple bash calls to experiment before submitting. Typical trajectories: 1–10 tool calls.

Environment Difficulty

Varies by library and perturbation type. Problems range from straightforward API usage to complex multi-step data transformations.

Safety

Code is executed in an isolated sandbox environment. No access to external systems or sensitive data.

Citations

@inproceedings{lai2023ds1000,
  title={DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation},
  author={Lai, Yuhang and Li, Chengxi and Wang, Yiming and Zhang, Tianyi and Zhong, Ruiqi and Zettlemoyer, Luke and Yih, Scott Wen-tau and Fried, Daniel and Wang, Sida and Yu, Tao},
  booktitle={International Conference on Machine Learning},
  year={2023}
}

@article{bragg2025astabench,
  title={AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite},
  author={Bragg, Jonathan and D'Arcy, Mike and Balepur, Nishant and Bareket, Dan and Dalvi, Bhavana and Feldman, Sergey and Haddad, Dany and Hwang, Jena D. and Jansen, Peter and Kishore, Varsha and Majumder, Bodhisattwa Prasad and Naik, Aakanksha and Rahamimov, Sigal and Richardson, Kyle and Singh, Amanpreet and Surana, Harshit and Tiktinsky, Aryeh and Vasu, Rosni and Wiener, Guy and Anastasiades, Chloe and Candra, Stefan and Dunkelberger, Jason and Emery, Dan and Evans, Rob and Hamada, Malachi and Huff, Regan and Kinney, Rodney and Latzke, Matt and Lochner, Jaron and Lozano-Aguilera, Ruben and Nguyen, Cecile and Rao, Smita and Tanaka, Amber and Vlahos, Brooke and Clark, Peter and Downey, Doug and Goldberg, Yoav and Sabharwal, Ashish and Weld, Daniel S.},
  journal={arXiv preprint arXiv:2510.21652},
  year={2025}
}

Repository

Source repository

EnvCommons/DS1000

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	2 vCPUs / 8 GB RAM
Sandbox Machine	1 vCPU / 2 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000640
Sandbox	$0.0000230
Total	$0.0000870

Examples

5-minute session$0.0261

1-hour session$0.3132

DS1000

GeneralReasoning/DS1000

DS-1000

Description

Capabilities

Compute Requirements

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples