discoverybench

Description

DiscoveryBench is the first comprehensive benchmark that formalizes and evaluates the multi-step process of data-driven discovery using large language models. It comprises 264 tasks across six diverse domains derived from published workflows (each defined by a dataset, metadata, and a natural-language discovery goal) plus 903 synthetic tasks for controlled complexity evaluations, and provides a structured, facet-based evaluation to analyze failure modes.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
GeneralReasoningGeneralReasoning/DiscoveryBench
0
2 months ago
arXiv/discoverybench | OpenReward