PaperBench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

PaperBench

OpenReward Environment

Description

PaperBench is an environment for evaluating language model agents on their ability to replicate machine learning research papers. Based on the PaperBench benchmark from OpenAI, agents are given a research paper and must produce a complete reproduction — implementing the paper's methods, executing experiments, and generating results that match the original findings. Tasks are drawn from ICML 2024 papers spanning diverse ML topics, with detailed hierarchical rubrics co-developed with the original paper authors.

Capabilities

  • Reading and understanding complex ML research papers
  • Implementing algorithms and models described in papers from scratch
  • Setting up experimental pipelines and running experiments on GPU
  • Iterating on code using bash, file viewing, and editing tools
  • Producing a self-contained reproduction script (reproduce.sh) and submission repository

Compute Requirements

Agents in PaperBench are given a sandbox with an NVIDIA L4 GPU, with internet access enabled. The sandbox includes a pre-configured Python virtual environment and Docker for building custom environments if needed.

License

MIT.

Tasks

There are 2 splits with 23 tasks total, each corresponding to a distinct ICML 2024 paper:

SplitTasksTypeDescription
dev3validationSmall subset for development testing
main20testFull evaluation set of ICML 2024 papers

Each task provides the agent with a research paper (in PDF and markdown format), an addendum with clarifications, and a blacklist of resources the agent must not use (e.g., the paper's original codebase). The agent must produce a git repository at /home/agent/submission/ containing source code and a reproduce.sh script.

Reward Structure

This is a sparse reward environment with continuous scoring. The answer tool triggers grading at the end of the episode, while intermediate tool calls (bash, view, str_replace, insert, create) return a reward of 0.

Grading uses an LLM-as-judge (o3-mini with high reasoning effort) that evaluates the agent's submission against a hierarchical rubric. The rubric decomposes each paper into fine-grained requirements across three categories:

CategoryScoringDescription
Code DevelopmentBinary (0 or 1)Was the algorithm/method correctly implemented?
Code ExecutionBinary (0 or 1)Did the code run successfully via reproduce.sh?
Result AnalysisBinary (0 or 1)Do reproduced results match the paper's findings?

Parent nodes in the rubric receive weighted averages of their children's scores, propagating up to a single root score between 0 and 1 that serves as the final reward.

Data

Each task contains a paper directory with:

  • paper.pdf and paper.md — the research paper
  • addendum.md — clarifications and scope notes
  • blacklist.txt — resources the agent must not access
  • rubric.json — hierarchical grading rubric (removed from the agent's sandbox to prevent cheating; used server-side for grading)

Paper data is stored on the OpenReward platform. The rubric is never exposed to the agent.

Tools

Agents are given six tools:

  • bash: Execute shell commands in the sandbox (with Python virtualenv auto-activated)
  • view: View file contents with optional line range
  • str_replace: Find and replace text in files
  • insert: Insert content at a specific line number
  • create: Create a new file with given content
  • answer: Submit the final reproduction for grading (terminates the episode)

Time Horizon

PaperBench is a long-horizon, multi-turn environment. Agents must read a paper, implement its methods, run experiments, and iterate on their code. The default reproduction timeout is 12 hours. There is no limit on the number of tool calls; the agent decides when to call answer.

Environment Difficulty

Results from the original PaperBench paper using BasicAgent with a 12-hour time limit (average replication score across 20 papers):

ModelScore
Claude 3.5 Sonnet (New)21.0%
o1-high13.2%
DeepSeek-R16.0%
GPT-4o4.1%
Gemini 2.0 Flash3.2%
o3-mini-high2.6%

With an iterative agent scaffold (no early exit), o1-high achieves 24.4% and reaches 26.0% with a 36-hour time limit. For comparison, ML PhDs achieved 41.4% on a 3-paper subset after 48 hours of work.

Other Environment Requirements

PaperBench requires the following secrets to be passed via the session:

  • openai_api_keyRequired. Used server-side by the SimpleJudge (o3-mini) for grading, and passed into the sandbox so agents can use the OpenAI API during reproduction.
  • hf_tokenRequired. Passed into the sandbox so agents can download datasets and model weights from HuggingFace.

Safety

Agents in PaperBench operate within isolated sandboxes with full internet access enabled (required for downloading datasets and model weights). The sandbox is destroyed after the session ends. The rubric is removed from the agent's environment to prevent the agent from gaming the evaluation. The primary safety consideration is that agents execute arbitrary code with GPU access and network connectivity, which is contained by the sandbox.

Citations

@article{starace2025paperbench,
  title={PaperBench: Evaluating AI's Ability to Replicate AI Research},
  author={Starace, Giulio and Jaffe, Oliver and Sherburn, Dane and Aung, James and Chan, Jun Shern and Maksin, Leon and Dias, Rachel and Mays, Evan and Kinsella, Benjamin and Thompson, Wyatt and Heidecke, Johannes and Glaese, Amelia and Patwardhan, Tejal},
  journal={arXiv preprint arXiv:2504.01848},
  year={2025}
}
GeneralReasoning/PaperBench | OpenReward