replicationbench
ReplicationBench
Description
ReplicationBench is an environment for evaluating whether AI agents can replicate research results from astrophysics papers. Each task requires the agent to reproduce a specific scientific result from a published paper, including data analysis, numerical simulations, and computational physics, graded by comparing against the original paper's results.
This OpenReward implementation is ported from the Harbor Framework implementation originally made by Steven Dillmann.
Capabilities
- Implementing scientific simulations and numerical methods
- Reproducing computational physics results (e.g., Ewald summation, gravitational wave analysis)
- Data analysis and statistical computation
- Working with scientific Python libraries (NumPy, SciPy, Astropy, etc.)
Compute Requirements
Agents are given a sandboxed environment with bash access and file editing tools for scientific computing. Sandbox size is 1 CPU and 2 GB RAM.
License
MIT.
Tasks
There is one split in this environment:
- Test: 90 research replication tasks
Tasks are drawn from astrophysics papers covering topics including cosmological simulations, galaxy clustering, gravitational wave detection, Bayesian calibration, and materials science.
Reward Structure
This is a multi-turn, sandbox-based environment. The agent develops code iteratively, then calls submit_answer to trigger automated verification. The agent must write /app/result.json containing {"value": <result>}. The reward is 1.0 if the value matches the paper's expected output within a task-specific absolute tolerance, and 0.0 otherwise.
Data
Each task directory contains an instruction.md describing the specific result to reproduce and a tests/ directory with verification scripts. Task data is stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
bash | Execute shell commands in the sandbox. |
str_replace | Replace a unique string in a file. |
view | View file contents or list directory contents. |
create_file | Create a new file with specified content. |
submit_answer | Submit work for automated verification against the paper's results. |
Time Horizon
ReplicationBench is a multi-turn environment. Agents read paper specifications, implement scientific code, debug and test, and submit for verification.
Environment Difficulty
ReplicationBench is challenging. The original paper reports that even frontier models struggle to replicate astrophysics research:
| Model | Score | Task Completion |
|---|---|---|
| Claude 3.7 Sonnet | 19.3% | 93% |
| Claude 4 Sonnet | 18.3% | 93% |
| OpenAI o3 | 13.6% | - |
| Gemini 2.5 Pro | 10.6% | 58% |
| OpenAI o4-mini | 8.0% | 50% |
Task completion rates vary significantly, with some agents submitting prematurely without attempting all tasks. When forced to guess without computation, models achieved under 9% correct answers, suggesting the scores represent genuine computational work rather than memorization.
Other Environment Requirements
There are no further environment requirements; ReplicationBench works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in ReplicationBench replicate published scientific results in a sandboxed environment. The environment does not present direct safety risks.
Citations
@article{ye2025replicationbench,
author = {Christine Ye and Sihan Yuan and Suchetha Cooray and Steven Dillmann and Ian L. V. Roque and Dalya Baron and Philipp Frank and Sergio Martin-Alvarez and Nolan Koblischke and Frank J. Qu and Diyi Yang and Risa Wechsler and Ioana Ciuca},
title = {ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?},
journal = {arXiv preprint arXiv:2510.24591},
year = {2025},
url = {https://arxiv.org/abs/2510.24591}
}