corebench-easy
CORE-Bench (Easy)
Description
CORE-Bench (Easy) is an environment for evaluating an agent's ability to interpret and analyze outputs from scientific code. Based on the CORE-Bench computational reproducibility benchmark, this easy variant presents tasks where the agent must answer questions about pre-computed scientific results (e.g., "Report the final AUC after training") by reading output files — without needing to execute the underlying code.
This OpenReward implementation is ported from the Harbor Framework implementation originally made by Michael Yang.
Capabilities
- Reading and interpreting scientific code outputs
- Extracting specific metrics from result files (CSVs, logs, figures)
- Understanding scientific computing workflows
- Producing structured answers in JSON format
Compute Requirements
Agents are given a sandboxed environment with bash access and file editing tools. Default sandbox size is 1 CPU and 2 GB RAM.
License
MIT.
Tasks
There is one split in this environment:
- Test: 18 scientific output analysis tasks
Tasks are drawn from CodeOcean capsules across computer science, social science, and medicine. Each task asks specific questions about pre-computed results.
Reward Structure
This is a multi-turn, sandbox-based environment. The agent reads result files, extracts the requested information, writes answers to a JSON file, and calls submit_answer for verification via pytest.
- 1.0: All extracted answers match expected values.
- 0.0: Any answer is incorrect or missing.
Data
Each task directory contains an instruction.md with questions about the results, a results/ directory with pre-computed outputs, and a tests/ directory with verification scripts. Task data is stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
bash | Execute shell commands in the sandbox. |
str_replace | Replace a unique string in a file. |
view | View file contents or list directory contents. |
create_file | Create a new file with specified content. |
submit_answer | Submit work for automated verification. |
Time Horizon
CORE-Bench (Easy) is a multi-turn environment. Agents read the question, explore result files, extract the requested metrics, and submit structured answers.
Environment Difficulty
The original paper evaluates agents across difficulty levels:
| Agent | Easy | Medium | Hard |
|---|---|---|---|
| CORE-Agent (GPT-4o) | 60.0% | 57.8% | 21.5% |
| CORE-Agent (GPT-4o-mini) | 44.4% | 32.6% | 16.3% |
| AutoGPT (GPT-4o) | 35.6% | 37.8% | 6.7% |
This easy variant focuses on result interpretation without code execution. Task-specific prompt modifications boosted easy task performance from 35.6% to 60%.
Other Environment Requirements
There are no further environment requirements; CORE-Bench (Easy) works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in CORE-Bench (Easy) read and analyze scientific output files in a sandboxed environment. The environment does not present direct safety risks.
Citations
@article{siegel2024corebench,
author = {Zachary S. Siegel and Sayash Kapoor and Nitya Nagdir and Benedikt Stroebl and Arvind Narayanan},
title = {CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark},
journal = {Transactions on Machine Learning Research (TMLR)},
year = {2024},
url = {https://arxiv.org/abs/2409.11363}
}