research-code-bench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

ResearchCodeBench

OpenReward Environment

Description

ResearchCodeBench is an environment for evaluating agents on implementing novel machine learning research code. Based on the ResearchCodeBench benchmark, agents are given coding challenges derived from cutting-edge ML papers (ICLR, NeurIPS, CVPR 2024-2025) and must implement the described algorithms within a sandboxed container. Verification runs a test suite that checks the agent's implementation against expected outputs.

This OpenReward implementation is ported from the Harbor Framework implementation originally made by Qi.

Capabilities

  • Implementing code from research paper descriptions
  • Reproducing research results in sandboxed environments
  • Multi-step software engineering for scientific computing
  • Understanding and translating academic papers into working code

Compute Requirements

Agents in ResearchCodeBench are given a sandbox with 1 CPU and 2 GB RAM.

License

CC BY-SA 4.0.

Tasks

There is one split: test (212 tasks). Each task corresponds to a research paper code reproduction challenge with a dedicated Docker image containing the project setup, instruction file, and test suite.

Reward Structure

This is a sparse, verifiable reward environment. The agent implements code snippets based on research paper descriptions, then calls submit_answer to trigger verification. The test suite runs pytest to compare the agent's implementation against a reference implementation, checking for numerical equivalence of outputs.

  • 1.0: All pytest tests pass (agent's implementation produces outputs matching the reference).
  • 0.0: Any test fails (implementation is incorrect or crashes).

We do not use LLM graders for this task.

Data

Tasks are derived from the ResearchCodeBench benchmark, which provides coding challenges from 20 recent ML papers at top-tier venues. Each task has a dedicated Docker image, instruction file, and tests. Data files are stored on the OpenReward platform.

Tools

Agents are given five tools:

  • bash: Run a bash command in the container.
  • str_replace: Replace a unique string in a file with another string.
  • view: View file contents or directory listings.
  • create_file: Create a new file with specified content.
  • submit_answer: Submit the final answer, triggering the test suite to run. Returns the test output and reward. This tool can only be called once per task.

Time Horizon

ResearchCodeBench is a multi-turn environment. The agent iterates using bash, view, str_replace, and create_file tools to implement the research code before submitting the final answer.

Environment Difficulty

Model performance on ResearchCodeBench from the original paper:

ModelPass Rate
Gemini-2.5-Pro-Preview37.3%
O3 (High)32.3%
O4-mini (High)30.8%

Even the best models correctly implement less than 40% of the code. 43 tasks in ResearchCodeBench-HARD achieve 0% pass rate across all 32 evaluated models.

Other Environment Requirements

There are no further environment requirements; ResearchCodeBench works out of the box with the OpenReward endpoint.

Safety

Agents in ResearchCodeBench implement research code inside sandboxed Docker containers. The environment does not present direct safety risks, as agents only interact with isolated containers with no access to external systems beyond the sandbox.

Citations

@inproceedings{hua2025researchcodebench,
  title={ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code},
  author={Hua, Tianyu and Hua, Harper and Xiang, Violet and Klieger, Benjamin and Truong, Sang T. and Liang, Weixin and Sun, Fan-Yun and Haber, Nick},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track},
  year={2025},
  url={https://arxiv.org/abs/2506.02314}
}
PatrickHua/research-code-bench | OpenReward