MMLU-Pro

API Endpoint
Leaderboard
Loading leaderboard...
README

MMLU-Pro

OpenReward Environment Hugging Face Dataset

Description

MMLU-Pro is an environment for evaluating advanced multi-task language understanding. Based on the MMLU-Pro benchmark by TIGER-Lab, it extends the original MMLU with harder, reasoning-focused questions and expands the choice set from 4 to 10 options. The benchmark contains over 12,000 rigorously curated questions across 14 domains from academic exams and textbooks.

Capabilities

  • Graduate-level academic reasoning across 14 domains
  • Ten-option multiple-choice question answering (A through J)
  • Chain-of-thought reasoning evaluation
  • STEM, humanities, social sciences, and professional domains

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT.

Tasks

There are two splits in this environment:

  • test: 12,000 tasks
  • validation: 70 tasks

Questions span 14 categories: Math, Physics, Chemistry, Law, Engineering, Economics, Health, Psychology, Business, Biology, Philosophy, Computer Science, History, and Other.

Reward Structure

This is a single-turn environment. The agent submits a single letter (A through J) via the answer tool. The answer is compared to the correct option by exact letter match after stripping whitespace and punctuation. Reward is binary: 1.0 if the letter matches the correct answer, 0.0 otherwise. No LLM grading is used.

Data

Data is loaded from HuggingFace TIGER-Lab/MMLU-Pro at module import time using the datasets library. Each row contains a question ID, question, up to 10 options, correct answer letter, answer index, chain-of-thought content, category, and source.

Tools

ToolDescription
answerSubmit your answer as a single letter (A through J). Ends the episode.

Time Horizon

Single-turn. The agent reads the question with options and submits one answer.

Environment Difficulty

MMLU-Pro causes a 16-33% accuracy drop compared to the original MMLU due to harder questions and 10 answer options, though frontier models are now approaching saturation.

ModelMMLU-Pro Score
Gemini 3 Pro90.1%
Claude Opus 4.5 (Reasoning)89.5%
Gemini 3 Flash88.6%
Claude Opus 4.1 (Thinking)87.9%

Other Environment Requirements

There are no further environment requirements.

Safety

Agents in MMLU-Pro answer multiple-choice academic questions in a standard environment. The environment does not present direct safety risks.

Citation

@inproceedings{wang2024mmlupro,
  title={MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark},
  author={Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu},
  booktitle={Proceedings of the Neural Information Processing Systems (NeurIPS)},
  year={2024}
}
GeneralReasoning/MMLU-Pro | OpenReward