core-bench
Description
CORE-Bench (Computational Reproducibility Agent Benchmark) is a benchmark for evaluating AI agents' ability to perform computational reproducibility—reproducing the results of scientific studies using the provided code and data. It comprises 270 tasks from 90 papers across computer science, social science, and medicine, spans three difficulty levels and both language-only and vision-language tasks, and includes a fast parallelizable evaluation system.
Leaderboard
Loading leaderboard...
Implementations (1)
| Environment | Stars | Last Updated | |
|---|---|---|---|
3 | 2 months ago |