scienceagentbench

Description

ScienceAgentBench is a benchmark for evaluating language agents on data-driven scientific discovery, consisting of 102 tasks extracted from 44 peer‑reviewed publications across four disciplines and validated by nine subject-matter experts. Each task is unified to a self-contained Python program and assessed with multiple metrics covering generated code correctness, execution results, and computational cost, with rounds of manual validation and contamination-mitigation strategies to ensure scientific authenticity and real-world relevance.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
GeneralReasoningGeneralReasoning/ScienceAgentBench
0
2 months ago
arXiv/scienceagentbench | OpenReward