ScienceAgentBench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

ScienceAgentBench

OpenReward Environment

Description

ScienceAgentBench is an environment for evaluating language agents on data-driven scientific discovery tasks. Agents write and execute Python programs that process real-world scientific data and produce output files, which are graded by task-specific evaluation scripts.

Capabilities

  • Writing self-contained Python programs for scientific data analysis
  • Processing diverse data types: molecular structures (SDF), geospatial data (GeoJSON, GeoTIFF), microscopy images (TIFF), physiological signals (CSV), single-cell data (H5AD)
  • Training ML models, performing statistical analysis, and generating visualizations
  • Using domain-specific scientific libraries (deepchem, geopandas, rdkit, biopsykit, etc.)

Compute Requirements

Sandbox: 2 CPU / 4 GB memory per session. Network access enabled for package installation. Custom Docker image with pre-installed scientific Python packages.

License

Creative Commons Attribution 4.0 International (CC BY 4.0). See the HuggingFace dataset card.

Tasks

  • Split: test (102 tasks)
  • Domains: Computational Chemistry, Bioinformatics, Geographical Information Science, Psychology and Cognitive Science
  • Task types: Feature Engineering, Deep Learning, Statistical Analysis, Data Visualization, Geospatial Analysis, Computational Analysis

Each task provides a natural language instruction, domain knowledge, dataset file structure and preview, and access to the data files in a sandbox environment.

Reward Structure

Binary reward (0.0 or 1.0) based on two criteria:

  1. Valid Execution Rate (VER): The agent's output file exists at the expected path
  2. Success Rate (SR): The output passes a task-specific evaluation script authored by domain experts

Reward is 1.0 only if both VER and SR pass.

Data

  • Source: osunlp/ScienceAgentBench (HuggingFace) + benchmark data from SharePoint (password: scienceagentbench)
  • Format: Per-task data files (CSV, SDF, GeoJSON, TIFF, H5AD, etc.) mounted in sandbox
  • Size: ~102 task-specific datasets, varying from KBs to MBs each

Tools

  • bash — Execute bash commands in the sandbox (write code, install packages, run programs)
  • submit — Terminal action: verify output file exists and run the task evaluation script

Time Horizon

Multi-turn. Agents iteratively explore data, write code, test, debug, and submit. Typical tasks require 10–50 tool calls.

Environment Difficulty

Challenging. Best-performing agents achieve ~34–42% success rate (direct prompting / o1-preview). Tasks span four scientific disciplines and require domain-specific knowledge and programming skills.

Safety

Tasks involve processing scientific data only. No safety-critical outputs. Sandbox is isolated with network access for package installation.

Citations

@article{Chen2024ScienceAgentBench,
  title={ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery},
  author={Chen, Ziru and Chen, Shijie and Ning, Yuting and Zhang, Qianheng and Wang, Boshi and Yu, Botao and Li, Yifei and Liao, Zeyi and Wei, Chen and Lu, Zitong and Dey, Vishal and Xue, Mingyi and Baker, Frazier N. and Burns, Benjamin and Adu-Ampratwum, Daniel and Huang, Xuhui and Ning, Xia and Gao, Song and Su, Yu and Sun, Huan},
  journal={arXiv preprint arXiv:2410.05080},
  year={2024}
}
GeneralReasoning/ScienceAgentBench | OpenReward