ml-dev-bench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

ML-Dev-Bench

OpenReward Environment

Description

ML-Dev-Bench is an environment for evaluating agents on end-to-end machine learning development tasks. Based on the ML-Dev-Bench benchmark, agents are given ML development tasks (e.g., training models, running experiments, debugging pipelines) and must complete them within a sandboxed container. Verification runs a test suite that checks the agent's output against expected results.

This OpenReward implementation is ported from the Harbor Framework implementation originally made by Harshith Padigela.

Capabilities

  • End-to-end machine learning model development
  • Debugging and fixing ML training pipelines
  • Running experiments and interpreting results
  • Multi-step software engineering for ML workflows

Compute Requirements

Agents in ML-Dev-Bench are given a sandbox. Machine size varies per task (derived from task configuration), with a default of 4 CPUs and 16 GB RAM.

License

MIT.

Tasks

There is one split: test (33 tasks). Each task corresponds to an ML development challenge with a dedicated Docker image containing the project setup, instruction file, and test suite.

Reward Structure

This is a sparse, verifiable reward environment. The agent works in the sandbox and calls submit_answer when finished. The environment uploads and runs the test suite inside the container, then reads the reward from the verifier output.

  • Pass: Reward determined by the test verifier (0.0 to 1.0).
  • Fail: Reward 0.0.

We do not use LLM graders for this task.

Data

Tasks are derived from the ML-Dev-Bench benchmark, which provides end-to-end ML development challenges. Each task has a dedicated Docker image, instruction file, and tests. Data files are stored on the OpenReward platform.

Tools

Agents are given five tools:

  • bash: Run a bash command in the container.
  • str_replace: Replace a unique string in a file with another string.
  • view: View file contents or directory listings.
  • create_file: Create a new file with specified content.
  • submit_answer: Submit the final answer, triggering the test suite to run. Returns the test output and reward. This tool can only be called once per task.

Time Horizon

ML-Dev-Bench is a multi-turn environment. The agent iterates using bash, view, str_replace, and create_file tools to develop and test ML solutions before submitting the final answer.

Environment Difficulty

ML-Dev-Bench is challenging, particularly for open-ended tasks. The original paper evaluates three agents across 30 tasks:

AgentOverallDataset HandlingDebuggingModel TrainingModel Perf.
OpenHands-Sonnet50%100%57%83%0%
ReAct-Sonnet47%100%57%67%0%
OpenHands-Gemini17%66%14%33%0%
AIDE-4o17%33%29%33%0%
ReAct-4o17%0%14%50%0%

Agents perform well on structured tasks like dataset handling but struggle with debugging and model implementation. All agents completely failed open-ended model performance improvement tasks.

Other Environment Requirements

There are no further environment requirements; ML-Dev-Bench works out of the box with the OpenReward endpoint.

Safety

Agents in ML-Dev-Bench develop machine learning models inside sandboxed Docker containers. The environment does not present direct safety risks, as agents only interact with isolated containers with no access to external systems beyond the sandbox.

Citations

@article{padigela2025mldevbench,
  title={ML-Dev-Bench: Comparative Analysis of AI Agents on ML Development Workflows},
  author={Padigela, Harshith and Shah, Chintan and Juyal, Dinkar},
  journal={arXiv preprint arXiv:2502.00964},
  year={2025}
}
GeneralReasoning/ml-dev-bench | OpenReward