replicationbench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

ReplicationBench

⭐ OpenReward Environment

Description

ReplicationBench is an environment for evaluating whether AI agents can replicate research results from astrophysics papers. Each task requires the agent to reproduce a specific scientific result from a published paper, including data analysis, numerical simulations, and computational physics, graded by comparing against the original paper's results.

This OpenReward implementation is ported from the Harbor Framework implementation originally made by Steven Dillmann.

Capabilities

  • Implementing scientific simulations and numerical methods
  • Reproducing computational physics results (e.g., Ewald summation, gravitational wave analysis)
  • Data analysis and statistical computation
  • Working with scientific Python libraries (NumPy, SciPy, Astropy, etc.)

Compute Requirements

Agents are given a sandboxed environment with bash access and file editing tools for scientific computing. Sandbox size is 1 CPU and 2 GB RAM.

License

MIT.

Tasks

There is one split in this environment:

  • Test: 90 research replication tasks

Tasks are drawn from astrophysics papers covering topics including cosmological simulations, galaxy clustering, gravitational wave detection, Bayesian calibration, and materials science.

Reward Structure

This is a multi-turn, sandbox-based environment. The agent develops code iteratively, then calls submit_answer to trigger automated verification. The agent must write /app/result.json containing {"value": <result>}. The reward is 1.0 if the value matches the paper's expected output within a task-specific absolute tolerance, and 0.0 otherwise.

Data

Each task directory contains an instruction.md describing the specific result to reproduce and a tests/ directory with verification scripts. Task data is stored on the OpenReward platform.

Tools

ToolDescription
bashExecute shell commands in the sandbox.
str_replaceReplace a unique string in a file.
viewView file contents or list directory contents.
create_fileCreate a new file with specified content.
submit_answerSubmit work for automated verification against the paper's results.

Time Horizon

ReplicationBench is a multi-turn environment. Agents read paper specifications, implement scientific code, debug and test, and submit for verification.

Environment Difficulty

ReplicationBench is challenging. The original paper reports that even frontier models struggle to replicate astrophysics research:

ModelScoreTask Completion
Claude 3.7 Sonnet19.3%93%
Claude 4 Sonnet18.3%93%
OpenAI o313.6%-
Gemini 2.5 Pro10.6%58%
OpenAI o4-mini8.0%50%

Task completion rates vary significantly, with some agents submitting prematurely without attempting all tasks. When forced to guess without computation, models achieved under 9% correct answers, suggesting the scores represent genuine computational work rather than memorization.

Other Environment Requirements

There are no further environment requirements; ReplicationBench works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in ReplicationBench replicate published scientific results in a sandboxed environment. The environment does not present direct safety risks.

Citations

@article{ye2025replicationbench,
  author    = {Christine Ye and Sihan Yuan and Suchetha Cooray and Steven Dillmann and Ian L. V. Roque and Dalya Baron and Philipp Frank and Sergio Martin-Alvarez and Nolan Koblischke and Frank J. Qu and Diyi Yang and Risa Wechsler and Ioana Ciuca},
  title     = {ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?},
  journal   = {arXiv preprint arXiv:2510.24591},
  year      = {2025},
  url       = {https://arxiv.org/abs/2510.24591}
}
GeneralReasoning/replicationbench | OpenReward