MMLU-Pro
MMLU-Pro
Description
MMLU-Pro is an environment for evaluating advanced multi-task language understanding. Based on the MMLU-Pro benchmark by TIGER-Lab, it extends the original MMLU with harder, reasoning-focused questions and expands the choice set from 4 to 10 options. The benchmark contains over 12,000 rigorously curated questions across 14 domains from academic exams and textbooks.
Capabilities
- Graduate-level academic reasoning across 14 domains
- Ten-option multiple-choice question answering (A through J)
- Chain-of-thought reasoning evaluation
- STEM, humanities, social sciences, and professional domains
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
License
MIT.
Tasks
There are two splits in this environment:
- test: 12,000 tasks
- validation: 70 tasks
Questions span 14 categories: Math, Physics, Chemistry, Law, Engineering, Economics, Health, Psychology, Business, Biology, Philosophy, Computer Science, History, and Other.
Reward Structure
This is a single-turn environment. The agent submits a single letter (A through J) via the answer tool. The answer is compared to the correct option by exact letter match after stripping whitespace and punctuation. Reward is binary: 1.0 if the letter matches the correct answer, 0.0 otherwise. No LLM grading is used.
Data
Data is loaded from HuggingFace TIGER-Lab/MMLU-Pro at module import time using the datasets library. Each row contains a question ID, question, up to 10 options, correct answer letter, answer index, chain-of-thought content, category, and source.
Tools
| Tool | Description |
|---|---|
answer | Submit your answer as a single letter (A through J). Ends the episode. |
Time Horizon
Single-turn. The agent reads the question with options and submits one answer.
Environment Difficulty
MMLU-Pro causes a 16-33% accuracy drop compared to the original MMLU due to harder questions and 10 answer options, though frontier models are now approaching saturation.
| Model | MMLU-Pro Score |
|---|---|
| Gemini 3 Pro | 90.1% |
| Claude Opus 4.5 (Reasoning) | 89.5% |
| Gemini 3 Flash | 88.6% |
| Claude Opus 4.1 (Thinking) | 87.9% |
Other Environment Requirements
There are no further environment requirements.
Safety
Agents in MMLU-Pro answer multiple-choice academic questions in a standard environment. The environment does not present direct safety risks.
Citation
@inproceedings{wang2024mmlupro,
title={MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark},
author={Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu},
booktitle={Proceedings of the Neural Information Processing Systems (NeurIPS)},
year={2024}
}