ReplicationBench

Description

ReplicationBench is an environment for evaluating whether AI agents can replicate research results from astrophysics papers. Each task requires the agent to reproduce a specific scientific result from a published paper, including data analysis, numerical simulations, and computational physics, graded by comparing against the original paper's results.

This OpenReward implementation is ported from the Harbor Framework implementation originally made by Steven Dillmann.

Capabilities

Implementing scientific simulations and numerical methods
Reproducing computational physics results (e.g., Ewald summation, gravitational wave analysis)
Data analysis and statistical computation
Working with scientific Python libraries (NumPy, SciPy, Astropy, etc.)

Compute Requirements

Agents are given a sandboxed environment with bash access and file editing tools for scientific computing. Sandbox size is 1 CPU and 2 GB RAM.

License

MIT.

Tasks

There is one split in this environment:

Test: 90 research replication tasks

Tasks are drawn from astrophysics papers covering topics including cosmological simulations, galaxy clustering, gravitational wave detection, Bayesian calibration, and materials science.

Reward Structure

This is a multi-turn, sandbox-based environment. The agent develops code iteratively, then calls submit_answer to trigger automated verification. The agent must write /app/result.json containing {"value": <result>}. The reward is 1.0 if the value matches the paper's expected output within a task-specific absolute tolerance, and 0.0 otherwise.

Data

Each task directory contains an instruction.md describing the specific result to reproduce and a tests/ directory with verification scripts. Task data is stored on the OpenReward platform.

Tools

Tool	Description
`bash`	Execute shell commands in the sandbox.
`str_replace`	Replace a unique string in a file.
`view`	View file contents or list directory contents.
`create_file`	Create a new file with specified content.
`submit_answer`	Submit work for automated verification against the paper's results.

Time Horizon

ReplicationBench is a multi-turn environment. Agents read paper specifications, implement scientific code, debug and test, and submit for verification.

Environment Difficulty

ReplicationBench is challenging. The original paper reports that even frontier models struggle to replicate astrophysics research:

Model	Score	Task Completion
Claude 3.7 Sonnet	19.3%	93%
Claude 4 Sonnet	18.3%	93%
OpenAI o3	13.6%	-
Gemini 2.5 Pro	10.6%	58%
OpenAI o4-mini	8.0%	50%

Task completion rates vary significantly, with some agents submitting prematurely without attempting all tasks. When forced to guess without computation, models achieved under 9% correct answers, suggesting the scores represent genuine computational work rather than memorization.

Other Environment Requirements

There are no further environment requirements; ReplicationBench works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in ReplicationBench replicate published scientific results in a sandboxed environment. The environment does not present direct safety risks.

Citations

@article{ye2025replicationbench,
  author    = {Christine Ye and Sihan Yuan and Suchetha Cooray and Steven Dillmann and Ian L. V. Roque and Dalya Baron and Philipp Frank and Sergio Martin-Alvarez and Nolan Koblischke and Frank J. Qu and Diyi Yang and Risa Wechsler and Ioana Ciuca},
  title     = {ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?},
  journal   = {arXiv preprint arXiv:2510.24591},
  year      = {2025},
  url       = {https://arxiv.org/abs/2510.24591}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

replicationbench

GeneralReasoning/replicationbench

ReplicationBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples