MLE-Bench
MLEBench
Description
MLEBench is an environment for evaluating machine learning engineering capabilities. Based on OpenAI's MLE-bench, agents are given Kaggle competitions and must develop, train, and submit ML models. Tasks cover diverse domains including classification, regression, computer vision, NLP, and time series across real Kaggle competition datasets.
Capabilities
- Machine learning model development and training
- Data exploration and feature engineering
- Working with Kaggle competition formats
- Iterative model improvement and submission validation
Compute Requirements
Competitions are assigned sandbox compute based on their requirements:
- GPU tasks (70 competitions):
nvidia-l4(NVIDIA L4 GPU) - CPU tasks (7 competitions):
4:8(4 CPUs, 8 GB RAM)
CPU-only competitions: new-york-city-taxi-fare-prediction, tabular-playground-series-dec-2021, tabular-playground-series-may-2022, ventilator-pressure-prediction, playground-series-s3e18, spaceship-titanic, nomad2018-predict-transparent-conductors.
License
Tasks
The original MLE-bench contains 82 Kaggle competitions. We currently use a 77-competition subset in a single split:
| Split | Type | Tasks | Description |
|---|---|---|---|
or_initial | test | 77 | 70 GPU + 7 CPU competitions |
Reward Structure
MLEBench is a multi-turn environment. The agent develops an ML model in a sandbox, produces a submission CSV at /home/agent/submission/submission.csv, and calls answer to submit. Reward is normalized: (score - bad_baseline) / (max_baseline - bad_baseline), where baselines are competition-specific. Returns -5 on grading error.
Data
Competition datasets are mounted read-only from cloud storage. Each competition includes:
train.csv- training data with labelstest.csv- test data for predictionssample_submission.csv- required submission formatdescription.md- competition objectives and evaluation metric
Tools
MLEBench provides 6 tools for agents:
- bash - Execute shell commands in the sandbox
- view - View file contents with optional line range
- str_replace - Replace strings in files
- insert - Insert content at a specified line
- create - Create new files
- answer - Submit the final submission CSV for grading
Time Horizon
MLEBench is a multi-turn environment. Agents read competition descriptions, explore data, develop models, train and iterate, validate submissions, and submit for final scoring.
Environment Difficulty
Competitions span a wide range of difficulty, from tabular classification (e.g., spaceship-titanic) to complex computer vision and NLP tasks (e.g., rsna-breast-cancer-detection, lmsys-chatbot-arena).
Other Environment Requirements
There are no further environment requirements; MLEBench works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in MLEBench develop ML models in a sandboxed environment. The environment does not present direct safety risks.
Citations
@article{chan2024mlebench,
title={MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering},
author={Chan, Jun Shern and Chowdhury, Neil and Jaffe, Oliver and Aung, James and Sherburn, Dane and Mays, Evan and Starace, Giulio and Liu, Kevin and Maksin, Lukasz and Patil, Tejal and Rein, David and Beutel, Alex},
journal={arXiv preprint arXiv:2410.07095},
year={2024}
}