ml-dev-bench
ML-Dev-Bench
Description
ML-Dev-Bench is an environment for evaluating agents on end-to-end machine learning development tasks. Based on the ML-Dev-Bench benchmark, agents are given ML development tasks (e.g., training models, running experiments, debugging pipelines) and must complete them within a sandboxed container. Verification runs a test suite that checks the agent's output against expected results.
This OpenReward implementation is ported from the Harbor Framework implementation originally made by Harshith Padigela.
Capabilities
- End-to-end machine learning model development
- Debugging and fixing ML training pipelines
- Running experiments and interpreting results
- Multi-step software engineering for ML workflows
Compute Requirements
Agents in ML-Dev-Bench are given a sandbox. Machine size varies per task (derived from task configuration), with a default of 4 CPUs and 16 GB RAM.
License
MIT.
Tasks
There is one split: test (33 tasks). Each task corresponds to an ML development challenge with a dedicated Docker image containing the project setup, instruction file, and test suite.
Reward Structure
This is a sparse, verifiable reward environment. The agent works in the sandbox and calls submit_answer when finished. The environment uploads and runs the test suite inside the container, then reads the reward from the verifier output.
- Pass: Reward determined by the test verifier (0.0 to 1.0).
- Fail: Reward 0.0.
We do not use LLM graders for this task.
Data
Tasks are derived from the ML-Dev-Bench benchmark, which provides end-to-end ML development challenges. Each task has a dedicated Docker image, instruction file, and tests. Data files are stored on the OpenReward platform.
Tools
Agents are given five tools:
bash: Run a bash command in the container.str_replace: Replace a unique string in a file with another string.view: View file contents or directory listings.create_file: Create a new file with specified content.submit_answer: Submit the final answer, triggering the test suite to run. Returns the test output and reward. This tool can only be called once per task.
Time Horizon
ML-Dev-Bench is a multi-turn environment. The agent iterates using bash, view, str_replace, and create_file tools to develop and test ML solutions before submitting the final answer.
Environment Difficulty
ML-Dev-Bench is challenging, particularly for open-ended tasks. The original paper evaluates three agents across 30 tasks:
| Agent | Overall | Dataset Handling | Debugging | Model Training | Model Perf. |
|---|---|---|---|---|---|
| OpenHands-Sonnet | 50% | 100% | 57% | 83% | 0% |
| ReAct-Sonnet | 47% | 100% | 57% | 67% | 0% |
| OpenHands-Gemini | 17% | 66% | 14% | 33% | 0% |
| AIDE-4o | 17% | 33% | 29% | 33% | 0% |
| ReAct-4o | 17% | 0% | 14% | 50% | 0% |
Agents perform well on structured tasks like dataset handling but struggle with debugging and model implementation. All agents completely failed open-ended model performance improvement tasks.
Other Environment Requirements
There are no further environment requirements; ML-Dev-Bench works out of the box with the OpenReward endpoint.
Safety
Agents in ML-Dev-Bench develop machine learning models inside sandboxed Docker containers. The environment does not present direct safety risks, as agents only interact with isolated containers with no access to external systems beyond the sandbox.
Citations
@article{padigela2025mldevbench,
title={ML-Dev-Bench: Comparative Analysis of AI Agents on ML Development Workflows},
author={Padigela, Harshith and Shah, Chintan and Juyal, Dinkar},
journal={arXiv preprint arXiv:2502.00964},
year={2025}
}