PaperBench
Description
PaperBench is a benchmark for evaluating AI agents' ability to replicate state-of-the-art AI research from scratch by reproducing 20 ICML 2024 Spotlight and Oral papers — including understanding contributions, developing codebases, and executing experiments. It comprises 8,316 individually gradable tasks defined by hierarchical rubrics co-developed with paper authors, and provides an LLM-based judge plus a separate judge-evaluation benchmark to enable scalable automatic grading.
Leaderboard
Loading leaderboard...
Implementations (1)
| Environment | Stars | Last Updated | |
|---|---|---|---|
0 | 1 months ago |