exp-bench
Description
EXP-Bench (Experimental AI Research Benchmark) is a benchmark for systematically evaluating AI agents on end-to-end research experiments sourced from influential AI papers, where agents must formulate hypotheses, design and implement procedures, execute experiments, and analyze results given incomplete starter code. It provides 461 curated tasks from 51 top-tier papers assembled via a semi‑autonomous pipeline that extracts experimental details, and highlights current agent limitations (e.g., 20–35% scores on subtasks and a 0.5% full-experiment success rate).
Leaderboard
Loading leaderboard...
Implementations (1)
| Environment | Stars | Last Updated | |
|---|---|---|---|
0 | 1 months ago |