CORE-Bench (Easy)

Description

CORE-Bench (Easy) is an environment for evaluating an agent's ability to interpret and analyze outputs from scientific code. Based on the CORE-Bench computational reproducibility benchmark, this easy variant presents tasks where the agent must answer questions about pre-computed scientific results (e.g., "Report the final AUC after training") by reading output files — without needing to execute the underlying code.

This OpenReward implementation is ported from the Harbor Framework implementation originally made by Michael Yang.

Capabilities

Reading and interpreting scientific code outputs
Extracting specific metrics from result files (CSVs, logs, figures)
Understanding scientific computing workflows
Producing structured answers in JSON format

Compute Requirements

Agents are given a sandboxed environment with bash access and file editing tools. Default sandbox size is 1 CPU and 2 GB RAM.

License

MIT.

Tasks

There is one split in this environment:

Test: 18 scientific output analysis tasks

Tasks are drawn from CodeOcean capsules across computer science, social science, and medicine. Each task asks specific questions about pre-computed results.

Reward Structure

This is a multi-turn, sandbox-based environment. The agent reads result files, extracts the requested information, writes answers to a JSON file, and calls submit_answer for verification via pytest.

1.0: All extracted answers match expected values.
0.0: Any answer is incorrect or missing.

Data

Each task directory contains an instruction.md with questions about the results, a results/ directory with pre-computed outputs, and a tests/ directory with verification scripts. Task data is stored on the OpenReward platform.

Tools

Tool	Description
`bash`	Execute shell commands in the sandbox.
`str_replace`	Replace a unique string in a file.
`view`	View file contents or list directory contents.
`create_file`	Create a new file with specified content.
`submit_answer`	Submit work for automated verification.

Time Horizon

CORE-Bench (Easy) is a multi-turn environment. Agents read the question, explore result files, extract the requested metrics, and submit structured answers.

Environment Difficulty

The original paper evaluates agents across difficulty levels:

Agent	Easy	Medium	Hard
CORE-Agent (GPT-4o)	60.0%	57.8%	21.5%
CORE-Agent (GPT-4o-mini)	44.4%	32.6%	16.3%
AutoGPT (GPT-4o)	35.6%	37.8%	6.7%

This easy variant focuses on result interpretation without code execution. Task-specific prompt modifications boosted easy task performance from 35.6% to 60%.

Other Environment Requirements

There are no further environment requirements; CORE-Bench (Easy) works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in CORE-Bench (Easy) read and analyze scientific output files in a sandboxed environment. The environment does not present direct safety risks.

Citations

@article{siegel2024corebench,
  author    = {Zachary S. Siegel and Sayash Kapoor and Nitya Nagdir and Benedikt Stroebl and Arvind Narayanan},
  title     = {CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark},
  journal   = {Transactions on Machine Learning Research (TMLR)},
  year      = {2024},
  url       = {https://arxiv.org/abs/2409.11363}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

corebench-easy

siegelz/corebench-easy

CORE-Bench (Easy)

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples