ML-Dev-Bench

Description

ML-Dev-Bench is an environment for evaluating agents on end-to-end machine learning development tasks. Based on the ML-Dev-Bench benchmark, agents are given ML development tasks (e.g., training models, running experiments, debugging pipelines) and must complete them within a sandboxed container. Verification runs a test suite that checks the agent's output against expected results.

This OpenReward implementation is ported from the Harbor Framework implementation originally made by Harshith Padigela.

Capabilities

End-to-end machine learning model development
Debugging and fixing ML training pipelines
Running experiments and interpreting results
Multi-step software engineering for ML workflows

Compute Requirements

Agents in ML-Dev-Bench are given a sandbox. Machine size varies per task (derived from task configuration), with a default of 4 CPUs and 16 GB RAM.

License

MIT.

Tasks

There is one split: test (33 tasks). Each task corresponds to an ML development challenge with a dedicated Docker image containing the project setup, instruction file, and test suite.

Reward Structure

This is a sparse, verifiable reward environment. The agent works in the sandbox and calls submit_answer when finished. The environment uploads and runs the test suite inside the container, then reads the reward from the verifier output.

Pass: Reward determined by the test verifier (0.0 to 1.0).
Fail: Reward 0.0.

We do not use LLM graders for this task.

Data

Tasks are derived from the ML-Dev-Bench benchmark, which provides end-to-end ML development challenges. Each task has a dedicated Docker image, instruction file, and tests. Data files are stored on the OpenReward platform.

Tools

Agents are given five tools:

bash: Run a bash command in the container.
str_replace: Replace a unique string in a file with another string.
view: View file contents or directory listings.
create_file: Create a new file with specified content.
submit_answer: Submit the final answer, triggering the test suite to run. Returns the test output and reward. This tool can only be called once per task.

Time Horizon

ML-Dev-Bench is a multi-turn environment. The agent iterates using bash, view, str_replace, and create_file tools to develop and test ML solutions before submitting the final answer.

Environment Difficulty

ML-Dev-Bench is challenging, particularly for open-ended tasks. The original paper evaluates three agents across 30 tasks:

Agent	Overall	Dataset Handling	Debugging	Model Training	Model Perf.
OpenHands-Sonnet	50%	100%	57%	83%	0%
ReAct-Sonnet	47%	100%	57%	67%	0%
OpenHands-Gemini	17%	66%	14%	33%	0%
AIDE-4o	17%	33%	29%	33%	0%
ReAct-4o	17%	0%	14%	50%	0%

Agents perform well on structured tasks like dataset handling but struggle with debugging and model implementation. All agents completely failed open-ended model performance improvement tasks.

Other Environment Requirements

There are no further environment requirements; ML-Dev-Bench works out of the box with the OpenReward endpoint.

Safety

Agents in ML-Dev-Bench develop machine learning models inside sandboxed Docker containers. The environment does not present direct safety risks, as agents only interact with isolated containers with no access to external systems beyond the sandbox.

Citations

@article{padigela2025mldevbench,
  title={ML-Dev-Bench: Comparative Analysis of AI Agents on ML Development Workflows},
  author={Padigela, Harshith and Shah, Chintan and Juyal, Dinkar},
  journal={arXiv preprint arXiv:2502.00964},
  year={2025}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

ml-dev-bench

GeneralReasoning/ml-dev-bench

ML-Dev-Bench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples