core-bench

Name: arXiv/core-bench
Author: arXiv

arXiv/core-bench

Description

CORE-Bench (Computational Reproducibility Agent Benchmark) is a benchmark for evaluating AI agents' ability to perform computational reproducibility—reproducing the results of scientific studies using the provided code and data. It comprises 270 tasks from 90 papers across computer science, social science, and medicine, spans three difficulty levels and both language-only and vision-language tasks, and includes a fast parallelizable evaluation system.

arXiv GitHub HuggingFace

Leaderboard

Loading leaderboard...

Implementations (1)

Environment	Stars	Last Updated
siegelz/corebench-easy	3	3 months ago