ainsteinbench
Description
AInsteinBench is a large-scale benchmark for evaluating whether LLM agents can operate as scientific computing development agents within real research software ecosystems, using end-to-end tasks grounded in production-grade repositories. It comprises maintainer-authored pull-request tasks from six scientific codebases (quantum chemistry, quantum computing, molecular dynamics, numerical relativity, fluid dynamics, and cheminformatics), curated via multi-stage filtering and expert review and evaluated in executable environments with test-driven verification to measure scientifically meaningful failures and core computational research competencies.
Leaderboard
Loading leaderboard...
Implementations (1)
| Environment | Stars | Last Updated | |
|---|---|---|---|
0 | 1 months ago |