scienceagentbench

Name: arXiv/scienceagentbench
Author: arXiv

arXiv/scienceagentbench

Description

ScienceAgentBench is a benchmark for evaluating language agents on data-driven scientific discovery, consisting of 102 tasks extracted from 44 peer‑reviewed publications across four disciplines and validated by nine subject-matter experts. Each task is unified to a self-contained Python program and assessed with multiple metrics covering generated code correctness, execution results, and computational cost, with rounds of manual validation and contamination-mitigation strategies to ensure scientific authenticity and real-world relevance.

arXiv

Leaderboard

Loading leaderboard...

Implementations (1)

Environment	Stars	Last Updated
GeneralReasoning/ScienceAgentBench	0	3 months ago