corebench-easy

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

CORE-Bench (Easy)

⭐ OpenReward Environment

Description

CORE-Bench (Easy) is an environment for evaluating an agent's ability to interpret and analyze outputs from scientific code. Based on the CORE-Bench computational reproducibility benchmark, this easy variant presents tasks where the agent must answer questions about pre-computed scientific results (e.g., "Report the final AUC after training") by reading output files — without needing to execute the underlying code.

This OpenReward implementation is ported from the Harbor Framework implementation originally made by Michael Yang.

Capabilities

  • Reading and interpreting scientific code outputs
  • Extracting specific metrics from result files (CSVs, logs, figures)
  • Understanding scientific computing workflows
  • Producing structured answers in JSON format

Compute Requirements

Agents are given a sandboxed environment with bash access and file editing tools. Default sandbox size is 1 CPU and 2 GB RAM.

License

MIT.

Tasks

There is one split in this environment:

  • Test: 18 scientific output analysis tasks

Tasks are drawn from CodeOcean capsules across computer science, social science, and medicine. Each task asks specific questions about pre-computed results.

Reward Structure

This is a multi-turn, sandbox-based environment. The agent reads result files, extracts the requested information, writes answers to a JSON file, and calls submit_answer for verification via pytest.

  • 1.0: All extracted answers match expected values.
  • 0.0: Any answer is incorrect or missing.

Data

Each task directory contains an instruction.md with questions about the results, a results/ directory with pre-computed outputs, and a tests/ directory with verification scripts. Task data is stored on the OpenReward platform.

Tools

ToolDescription
bashExecute shell commands in the sandbox.
str_replaceReplace a unique string in a file.
viewView file contents or list directory contents.
create_fileCreate a new file with specified content.
submit_answerSubmit work for automated verification.

Time Horizon

CORE-Bench (Easy) is a multi-turn environment. Agents read the question, explore result files, extract the requested metrics, and submit structured answers.

Environment Difficulty

The original paper evaluates agents across difficulty levels:

AgentEasyMediumHard
CORE-Agent (GPT-4o)60.0%57.8%21.5%
CORE-Agent (GPT-4o-mini)44.4%32.6%16.3%
AutoGPT (GPT-4o)35.6%37.8%6.7%

This easy variant focuses on result interpretation without code execution. Task-specific prompt modifications boosted easy task performance from 35.6% to 60%.

Other Environment Requirements

There are no further environment requirements; CORE-Bench (Easy) works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in CORE-Bench (Easy) read and analyze scientific output files in a sandboxed environment. The environment does not present direct safety risks.

Citations

@article{siegel2024corebench,
  author    = {Zachary S. Siegel and Sayash Kapoor and Nitya Nagdir and Benedikt Stroebl and Arvind Narayanan},
  title     = {CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark},
  journal   = {Transactions on Machine Learning Research (TMLR)},
  year      = {2024},
  url       = {https://arxiv.org/abs/2409.11363}
}
siegelz/corebench-easy | OpenReward