MMMU
MMMU
Description
MMMU (Massive Multi-discipline Multimodal Understanding) is an environment for evaluating college-level multimodal reasoning across 6 disciplines, 30 subjects, and 183 subfields. Each question includes up to 7 heterogeneous images (charts, diagrams, tables, chemical structures, music notation, etc.) and requires understanding complex visual and textual information.
Capabilities
- College-level multimodal question answering
- Up to 7 images per question with 30+ image types
- Multiple-choice evaluation across expert-level reasoning tasks
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
License
Tasks
Three splits in this environment:
- dev: 150 tasks
- validation: 900 tasks
- test: 10,500 tasks
Total: 11,550 college-level questions spanning Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering.
Reward Structure
Single-turn evaluation with deterministic grading. The agent submits a single letter answer via the submit_answer tool. The submitted answer is compared via exact match against the ground truth. Reward is 1.0 if correct, 0.0 if incorrect.
Data
Parquet files (~605 MB total) for dev, validation, and test splits sourced from HuggingFace MMMU/MMMU. Stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
submit_answer | Submit a single letter answer. Deterministic evaluation via exact match. Ends the episode. |
Time Horizon
Single-turn. The agent reads the multimodal question (text and images) and submits one answer.
Environment Difficulty
MMMU evaluates college-level multimodal understanding:
| Model | Accuracy |
|---|---|
| Gemini 3 Flash | 87.6% |
| Gemini 3 Pro | 87.5% |
| GPT-5.2 | 86.7% |
| Claude 4.5 Sonnet | 77.8% |
| Human Expert | 88.6% |
Models have now surpassed the performance of average human experts (76.2%) but still trail top human experts (88.6%).
Other Environment Requirements
There are no further environment requirements; MMMU works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in MMMU answer college-level multimodal questions in a standard environment. The environment does not present direct safety risks.
Citation
@article{yue2023mmmu,
title={MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI},
author={Yue, Xiang and Ni, Yuansheng and Zhang, Kai and Zheng, Tianyu and Liu, Ruoqi and Zhang, Ge and Stevens, Samuel and Jiang, Dongfu and Ren, Weiming and Sun, Yuxuan and others},
journal={arXiv preprint arXiv:2311.16502},
year={2023}
}