API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/scienceagentbench

README

ScienceAgentBench

Description

ScienceAgentBench is an environment for evaluating language agents on data-driven scientific discovery tasks. Agents write and execute Python programs that process real-world scientific data and produce output files, which are graded by task-specific evaluation scripts.

Capabilities

Writing self-contained Python programs for scientific data analysis
Processing diverse data types: molecular structures (SDF), geospatial data (GeoJSON, GeoTIFF), microscopy images (TIFF), physiological signals (CSV), single-cell data (H5AD)
Training ML models, performing statistical analysis, and generating visualizations
Using domain-specific scientific libraries (deepchem, geopandas, rdkit, biopsykit, etc.)

Compute Requirements

Sandbox: 2 CPU / 4 GB memory per session. Network access enabled for package installation. Custom Docker image with pre-installed scientific Python packages.

License

Creative Commons Attribution 4.0 International (CC BY 4.0). See the HuggingFace dataset card.

Tasks

Split: test (102 tasks)
Domains: Computational Chemistry, Bioinformatics, Geographical Information Science, Psychology and Cognitive Science
Task types: Feature Engineering, Deep Learning, Statistical Analysis, Data Visualization, Geospatial Analysis, Computational Analysis

Each task provides a natural language instruction, domain knowledge, dataset file structure and preview, and access to the data files in a sandbox environment.

Reward Structure

Binary reward (0.0 or 1.0) based on two criteria:

Valid Execution Rate (VER): The agent's output file exists at the expected path
Success Rate (SR): The output passes a task-specific evaluation script authored by domain experts

Reward is 1.0 only if both VER and SR pass.

Data

Source: osunlp/ScienceAgentBench (HuggingFace) + benchmark data from SharePoint (password: scienceagentbench)
Format: Per-task data files (CSV, SDF, GeoJSON, TIFF, H5AD, etc.) mounted in sandbox
Size: ~102 task-specific datasets, varying from KBs to MBs each

Tools

bash — Execute bash commands in the sandbox (write code, install packages, run programs)
submit — Terminal action: verify output file exists and run the task evaluation script

Time Horizon

Multi-turn. Agents iteratively explore data, write code, test, debug, and submit. Typical tasks require 10–50 tool calls.

Environment Difficulty

Challenging. Best-performing agents achieve ~34–42% success rate (direct prompting / o1-preview). Tasks span four scientific disciplines and require domain-specific knowledge and programming skills.

Safety

Tasks involve processing scientific data only. No safety-critical outputs. Sandbox is isolated with network access for package installation.

Citations

@article{Chen2024ScienceAgentBench,
  title={ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery},
  author={Chen, Ziru and Chen, Shijie and Ning, Yuting and Zhang, Qianheng and Wang, Boshi and Yu, Botao and Li, Yifei and Liao, Zeyi and Wei, Chen and Lu, Zitong and Dey, Vishal and Xue, Mingyi and Baker, Frazier N. and Burns, Benjamin and Adu-Ampratwum, Daniel and Huang, Xuhui and Ning, Xia and Gao, Song and Su, Yu and Sun, Huan},
  journal={arXiv preprint arXiv:2410.05080},
  year={2024}
}

Repository

Source repository

EnvCommons/ScienceAgentBench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	2 vCPUs / 4 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000460
Total	$0.0000780

Examples

5-minute session$0.0234

1-hour session$0.2808

ScienceAgentBench

GeneralReasoning/ScienceAgentBench

ScienceAgentBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples