MLEBench

Description

MLEBench is an environment for evaluating machine learning engineering capabilities. Based on OpenAI's MLE-bench, agents are given Kaggle competitions and must develop, train, and submit ML models. Tasks cover diverse domains including classification, regression, computer vision, NLP, and time series across real Kaggle competition datasets.

Capabilities

Machine learning model development and training
Data exploration and feature engineering
Working with Kaggle competition formats
Iterative model improvement and submission validation

Compute Requirements

Competitions are assigned sandbox compute based on their requirements:

GPU tasks (70 competitions): nvidia-l4 (NVIDIA L4 GPU)
CPU tasks (7 competitions): 4:8 (4 CPUs, 8 GB RAM)

CPU-only competitions: new-york-city-taxi-fare-prediction, tabular-playground-series-dec-2021, tabular-playground-series-may-2022, ventilator-pressure-prediction, playground-series-s3e18, spaceship-titanic, nomad2018-predict-transparent-conductors.

License

MIT

Tasks

The original MLE-bench contains 82 Kaggle competitions. We currently use a 77-competition subset in a single split:

Split	Type	Tasks	Description
`or_initial`	`test`	77	70 GPU + 7 CPU competitions

Reward Structure

MLEBench is a multi-turn environment. The agent develops an ML model in a sandbox, produces a submission CSV at /home/agent/submission/submission.csv, and calls answer to submit. Reward is normalized: (score - bad_baseline) / (max_baseline - bad_baseline), where baselines are competition-specific. Returns -5 on grading error.

Data

Competition datasets are mounted read-only from cloud storage. Each competition includes:

train.csv - training data with labels
test.csv - test data for predictions
sample_submission.csv - required submission format
description.md - competition objectives and evaluation metric

Tools

MLEBench provides 6 tools for agents:

bash - Execute shell commands in the sandbox
view - View file contents with optional line range
str_replace - Replace strings in files
insert - Insert content at a specified line
create - Create new files
answer - Submit the final submission CSV for grading

Time Horizon

MLEBench is a multi-turn environment. Agents read competition descriptions, explore data, develop models, train and iterate, validate submissions, and submit for final scoring.

Environment Difficulty

Competitions span a wide range of difficulty, from tabular classification (e.g., spaceship-titanic) to complex computer vision and NLP tasks (e.g., rsna-breast-cancer-detection, lmsys-chatbot-arena).

Other Environment Requirements

There are no further environment requirements; MLEBench works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in MLEBench develop ML models in a sandboxed environment. The environment does not present direct safety risks.

Citations

@article{chan2024mlebench,
  title={MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering},
  author={Chan, Jun Shern and Chowdhury, Neil and Jaffe, Oliver and Aung, James and Sherburn, Dane and Mays, Evan and Starace, Giulio and Liu, Kevin and Maksin, Lukasz and Patil, Tejal and Rein, David and Beutel, Alex},
  journal={arXiv preprint arXiv:2410.07095},
  year={2024}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Task-Dependent

Component	Cost / second
Environment	$0.0000320
Sandbox	Task-Dependent
Total	$0.0000320

MLE-Bench

GeneralReasoning/MLE-Bench

MLEBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples