ainsteinbench

Name: arXiv/ainsteinbench
Author: arXiv

arXiv/ainsteinbench

Description

AInsteinBench is a large-scale benchmark for evaluating whether LLM agents can operate as scientific computing development agents within real research software ecosystems, using end-to-end tasks grounded in production-grade repositories. It comprises maintainer-authored pull-request tasks from six scientific codebases (quantum chemistry, quantum computing, molecular dynamics, numerical relativity, fluid dynamics, and cheminformatics), curated via multi-stage filtering and expert review and evaluated in executable environments with test-driven verification to measure scientifically meaningful failures and core computational research competencies.

arXiv

Leaderboard

Loading leaderboard...

Implementations (1)

Environment	Stars	Last Updated
GeneralReasoning/AInsteinBench	0	3 months ago