research-code-bench
ResearchCodeBench
Description
ResearchCodeBench is an environment for evaluating agents on implementing novel machine learning research code. Based on the ResearchCodeBench benchmark, agents are given coding challenges derived from cutting-edge ML papers (ICLR, NeurIPS, CVPR 2024-2025) and must implement the described algorithms within a sandboxed container. Verification runs a test suite that checks the agent's implementation against expected outputs.
This OpenReward implementation is ported from the Harbor Framework implementation originally made by Qi.
Capabilities
- Implementing code from research paper descriptions
- Reproducing research results in sandboxed environments
- Multi-step software engineering for scientific computing
- Understanding and translating academic papers into working code
Compute Requirements
Agents in ResearchCodeBench are given a sandbox with 1 CPU and 2 GB RAM.
License
Tasks
There is one split: test (212 tasks). Each task corresponds to a research paper code reproduction challenge with a dedicated Docker image containing the project setup, instruction file, and test suite.
Reward Structure
This is a sparse, verifiable reward environment. The agent implements code snippets based on research paper descriptions, then calls submit_answer to trigger verification. The test suite runs pytest to compare the agent's implementation against a reference implementation, checking for numerical equivalence of outputs.
- 1.0: All pytest tests pass (agent's implementation produces outputs matching the reference).
- 0.0: Any test fails (implementation is incorrect or crashes).
We do not use LLM graders for this task.
Data
Tasks are derived from the ResearchCodeBench benchmark, which provides coding challenges from 20 recent ML papers at top-tier venues. Each task has a dedicated Docker image, instruction file, and tests. Data files are stored on the OpenReward platform.
Tools
Agents are given five tools:
bash: Run a bash command in the container.str_replace: Replace a unique string in a file with another string.view: View file contents or directory listings.create_file: Create a new file with specified content.submit_answer: Submit the final answer, triggering the test suite to run. Returns the test output and reward. This tool can only be called once per task.
Time Horizon
ResearchCodeBench is a multi-turn environment. The agent iterates using bash, view, str_replace, and create_file tools to implement the research code before submitting the final answer.
Environment Difficulty
Model performance on ResearchCodeBench from the original paper:
| Model | Pass Rate |
|---|---|
| Gemini-2.5-Pro-Preview | 37.3% |
| O3 (High) | 32.3% |
| O4-mini (High) | 30.8% |
Even the best models correctly implement less than 40% of the code. 43 tasks in ResearchCodeBench-HARD achieve 0% pass rate across all 32 evaluated models.
Other Environment Requirements
There are no further environment requirements; ResearchCodeBench works out of the box with the OpenReward endpoint.
Safety
Agents in ResearchCodeBench implement research code inside sandboxed Docker containers. The environment does not present direct safety risks, as agents only interact with isolated containers with no access to external systems beyond the sandbox.
Citations
@inproceedings{hua2025researchcodebench,
title={ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code},
author={Hua, Tianyu and Hua, Harper and Xiang, Violet and Klieger, Benjamin and Truong, Sang T. and Liang, Weixin and Sun, Fan-Yun and Haber, Nick},
booktitle={Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track},
year={2025},
url={https://arxiv.org/abs/2506.02314}
}