scienceagentbench
Description
ScienceAgentBench is a benchmark for evaluating language agents on data-driven scientific discovery, consisting of 102 tasks extracted from 44 peer‑reviewed publications across four disciplines and validated by nine subject-matter experts. Each task is unified to a self-contained Python program and assessed with multiple metrics covering generated code correctness, execution results, and computational cost, with rounds of manual validation and contamination-mitigation strategies to ensure scientific authenticity and real-world relevance.
Leaderboard
Loading leaderboard...
Implementations (1)
| Environment | Stars | Last Updated | |
|---|---|---|---|
0 | 2 months ago |