PaperBench

Name: OpenAI/PaperBench
Author: OpenAI

OpenAI/PaperBench

Description

PaperBench is a benchmark for evaluating AI agents' ability to replicate state-of-the-art AI research from scratch by reproducing 20 ICML 2024 Spotlight and Oral papers — including understanding contributions, developing codebases, and executing experiments. It comprises 8,316 individually gradable tasks defined by hierarchical rubrics co-developed with paper authors, and provides an LLM-based judge plus a separate judge-evaluation benchmark to enable scalable automatic grading.

arXiv

Leaderboard

Loading leaderboard...

Implementations (1)

Environment	Stars	Last Updated
GeneralReasoning/PaperBench	0	1 months ago