PaperBench

Description

PaperBench is a benchmark for evaluating AI agents' ability to replicate state-of-the-art AI research from scratch by reproducing 20 ICML 2024 Spotlight and Oral papers — including understanding contributions, developing codebases, and executing experiments. It comprises 8,316 individually gradable tasks defined by hierarchical rubrics co-developed with paper authors, and provides an LLM-based judge plus a separate judge-evaluation benchmark to enable scalable automatic grading.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
GeneralReasoningGeneralReasoning/PaperBench
0
1 months ago
OpenAI/PaperBench | OpenReward