MLE-Bench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

MLEBench

OpenReward Environment

Description

MLEBench is an environment for evaluating machine learning engineering capabilities. Based on OpenAI's MLE-bench, agents are given Kaggle competitions and must develop, train, and submit ML models. Tasks cover diverse domains including classification, regression, computer vision, NLP, and time series across real Kaggle competition datasets.

Capabilities

  • Machine learning model development and training
  • Data exploration and feature engineering
  • Working with Kaggle competition formats
  • Iterative model improvement and submission validation

Compute Requirements

Competitions are assigned sandbox compute based on their requirements:

  • GPU tasks (70 competitions): nvidia-l4 (NVIDIA L4 GPU)
  • CPU tasks (7 competitions): 4:8 (4 CPUs, 8 GB RAM)

CPU-only competitions: new-york-city-taxi-fare-prediction, tabular-playground-series-dec-2021, tabular-playground-series-may-2022, ventilator-pressure-prediction, playground-series-s3e18, spaceship-titanic, nomad2018-predict-transparent-conductors.

License

MIT

Tasks

The original MLE-bench contains 82 Kaggle competitions. We currently use a 77-competition subset in a single split:

SplitTypeTasksDescription
or_initialtest7770 GPU + 7 CPU competitions

Reward Structure

MLEBench is a multi-turn environment. The agent develops an ML model in a sandbox, produces a submission CSV at /home/agent/submission/submission.csv, and calls answer to submit. Reward is normalized: (score - bad_baseline) / (max_baseline - bad_baseline), where baselines are competition-specific. Returns -5 on grading error.

Data

Competition datasets are mounted read-only from cloud storage. Each competition includes:

  • train.csv - training data with labels
  • test.csv - test data for predictions
  • sample_submission.csv - required submission format
  • description.md - competition objectives and evaluation metric

Tools

MLEBench provides 6 tools for agents:

  • bash - Execute shell commands in the sandbox
  • view - View file contents with optional line range
  • str_replace - Replace strings in files
  • insert - Insert content at a specified line
  • create - Create new files
  • answer - Submit the final submission CSV for grading

Time Horizon

MLEBench is a multi-turn environment. Agents read competition descriptions, explore data, develop models, train and iterate, validate submissions, and submit for final scoring.

Environment Difficulty

Competitions span a wide range of difficulty, from tabular classification (e.g., spaceship-titanic) to complex computer vision and NLP tasks (e.g., rsna-breast-cancer-detection, lmsys-chatbot-arena).

Other Environment Requirements

There are no further environment requirements; MLEBench works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in MLEBench develop ML models in a sandboxed environment. The environment does not present direct safety risks.

Citations

@article{chan2024mlebench,
  title={MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering},
  author={Chan, Jun Shern and Chowdhury, Neil and Jaffe, Oliver and Aung, James and Sherburn, Dane and Mays, Evan and Starace, Giulio and Liu, Kevin and Maksin, Lukasz and Patil, Tejal and Rein, David and Beutel, Alex},
  journal={arXiv preprint arXiv:2410.07095},
  year={2024}
}
GeneralReasoning/MLE-Bench | OpenReward