exp-bench

Name: arXiv/exp-bench
Author: arXiv

arXiv/exp-bench

Description

EXP-Bench (Experimental AI Research Benchmark) is a benchmark for systematically evaluating AI agents on end-to-end research experiments sourced from influential AI papers, where agents must formulate hypotheses, design and implement procedures, execute experiments, and analyze results given incomplete starter code. It provides 461 curated tasks from 51 top-tier papers assembled via a semi‑autonomous pipeline that extracts experimental details, and highlights current agent limitations (e.g., 20–35% scores on subtasks and a 0.5% full-experiment success rate).

arXiv

Leaderboard

Loading leaderboard...

Implementations (1)

Environment	Stars	Last Updated
GeneralReasoning/EXP-Bench	0	3 months ago

exp-bench

arXiv/exp-bench

Description

Repository

Clone Repository