ml-dev-bench

Description

ML-Dev-Bench is a benchmark for testing agentic capabilities on applied Machine Learning development tasks. It assesses performance across dataset handling, model training, improving existing models, debugging, and API integration with popular ML tools through a diverse set of 30 tasks, evaluating agents such as ReAct, Openhands, and AIDE, and is open-sourced.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
GeneralReasoningGeneralReasoning/ml-dev-bench
0
1 months ago
arXiv/ml-dev-bench | OpenReward