API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/exp-bench

README

EXP-Bench

Description

EXP-Bench evaluates whether AI agents can independently conduct complete research experiments from published AI papers. Given a research question, method description, and an incomplete codebase, agents must formulate hypotheses, design experiments, implement code, execute experiments, and produce conclusions. Tasks are sourced from 50 papers across ICLR 2024 and NeurIPS 2024.

Capabilities

Designing experiments with correct independent, dependent, and control variables
Implementing experimental code from method descriptions
Executing and debugging research code end-to-end
Analyzing experimental results and drawing conclusions
Working within existing codebases with masked source files

Compute Requirements

Agents are given a sandbox with 8GB of memory and up to 4 CPUs. No GPU is required. The sandbox has network access to clone GitHub repositories and install dependencies.

License

CC0 1.0 (Public Domain), following the original paper.

Tasks

There are 452 tasks across 50 papers from two conferences:

Conference	Papers	Tasks
ICLR 2024	23	~215
NeurIPS 2024	27	~237

Both train and test splits currently contain the same 452 tasks.

Each task provides:

A research question to answer
A method description outlining the experimental approach
Optional agent instructions with step-by-step guidance
A GitHub repository with key source files masked (removed)

The agent must reconstruct the masked code, run the experiment, and submit a JSON response with its experiment design and conclusion.

Reward Structure

EXP-Bench uses a three-component grading system evaluated by an LLM judge (gpt-5.4 with reasoning). Each component is scored 0-100:

Design score -- evaluates whether the agent correctly identified constant, independent, and dependent variables
Conclusion score -- binary (0 or 100) based on semantic correctness of the conclusion
Setup score -- evaluates the agent's code changes (git diff) against ground-truth implementation requirements

The final reward is the harmonic mean of the three scores, normalized to [0, 1]:

$r = \frac{1}{100} \cdot \frac{3}{\frac{1}{s_{\text{design}}} + \frac{1}{s_{\text{conclusion}}} + \frac{1}{s_{\text{setup}}}}$

If any component scores 0, the final reward is 0. This prevents agents from gaming a single dimension while neglecting others.

Data

Tasks are derived from 50 published AI papers at ICLR 2024 and NeurIPS 2024. Each task references a public GitHub repository. During setup, the environment clones the repository, creates a private backup, and masks (removes) specified source files that the agent must re-implement.

Ground-truth data includes experiment variable specifications (design_complexity), expected conclusions (conclusion_gt), and implementation requirements (requirements) used by the LLM judge for grading.

Tools

Agents have access to 6 tools:

Tool	Description
`bash`	Execute bash commands in the sandbox (working directory: `/workspace`)
`view`	Read file contents with optional line range
`str_replace`	Replace text in a file (shows diff)
`insert`	Insert content at a specific line number (shows diff)
`create`	Create a new file with given content
`answer`	Submit final response for grading (reads `/workspace/response.json`)

Time Horizon

EXP-Bench is a long-horizon, multi-turn environment. Agents typically require many tool calls to explore the codebase, implement missing code, install dependencies, run experiments, debug failures, and produce results. The agent must also create a reproducibility script (reproduce_exp_bench.sh).

Environment Difficulty

EXP-Bench is a challenging benchmark. The original paper reports that individual grading components occasionally achieve 20-35% accuracy, but only 0.5% of end-to-end experiments succeed across evaluated models.

Other Environment Requirements

EXP-Bench requires:

OpenAI API key (openai_api_key) -- for LLM-based grading

Safety

Agents operate within a sandboxed environment with network access (required for cloning repositories and installing dependencies). The environment does not expose agents to sensitive data or systems beyond public GitHub repositories. The primary risk is computational resource consumption from arbitrary code execution, which is mitigated by sandbox resource limits.

Citations

@inproceedings{kon2025expbench,
  title     = {EXP-Bench: Can AI Conduct AI Research Experiments?},
  author    = {Kon, Patrick Tser Jern and Liu, Jiachen and Zhu, Xinyi and Ding, Qiuyi and Peng, Jingjia and Xing, Jiarong and Huang, Yibo and Qiu, Yiming and Srinivasa, Jayanth and Lee, Myungjin and Chowdhury, Mosharaf and Zaharia, Matei and Chen, Ang},
  booktitle = {The Fourteenth International Conference on Learning Representations (ICLR)},
  year      = {2026},
  url       = {https://arxiv.org/abs/2505.24785}
}

Repository

Source repository

EnvCommons/EXP-Bench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	2 vCPUs / 8 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000640
Total	$0.0000960

Examples

5-minute session$0.0288

1-hour session$0.3456

EXP-Bench

GeneralReasoning/EXP-Bench

EXP-Bench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples