API Endpoint

Leaderboard

Loading leaderboard...

README

MMLU-Pro

Description

MMLU-Pro is an environment for evaluating advanced multi-task language understanding. Based on the MMLU-Pro benchmark by TIGER-Lab, it extends the original MMLU with harder, reasoning-focused questions and expands the choice set from 4 to 10 options. The benchmark contains over 12,000 rigorously curated questions across 14 domains from academic exams and textbooks.

Capabilities

Graduate-level academic reasoning across 14 domains
Ten-option multiple-choice question answering (A through J)
Chain-of-thought reasoning evaluation
STEM, humanities, social sciences, and professional domains

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT.

Tasks

There are two splits in this environment:

test: 12,000 tasks
validation: 70 tasks

Questions span 14 categories: Math, Physics, Chemistry, Law, Engineering, Economics, Health, Psychology, Business, Biology, Philosophy, Computer Science, History, and Other.

Reward Structure

This is a single-turn environment. The agent submits a single letter (A through J) via the answer tool. The answer is compared to the correct option by exact letter match after stripping whitespace and punctuation. Reward is binary: 1.0 if the letter matches the correct answer, 0.0 otherwise. No LLM grading is used.

Data

Data is loaded from HuggingFace TIGER-Lab/MMLU-Pro at module import time using the datasets library. Each row contains a question ID, question, up to 10 options, correct answer letter, answer index, chain-of-thought content, category, and source.

Tools

Tool	Description
`answer`	Submit your answer as a single letter (A through J). Ends the episode.

Time Horizon

Single-turn. The agent reads the question with options and submits one answer.

Environment Difficulty

MMLU-Pro causes a 16-33% accuracy drop compared to the original MMLU due to harder questions and 10 answer options, though frontier models are now approaching saturation.

Model	MMLU-Pro Score
Gemini 3 Pro	90.1%
Claude Opus 4.5 (Reasoning)	89.5%
Gemini 3 Flash	88.6%
Claude Opus 4.1 (Thinking)	87.9%

Other Environment Requirements

There are no further environment requirements.

Safety

Agents in MMLU-Pro answer multiple-choice academic questions in a standard environment. The environment does not present direct safety risks.

Citation

@inproceedings{wang2024mmlupro,
  title={MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark},
  author={Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu},
  booktitle={Proceedings of the Neural Information Processing Systems (NeurIPS)},
  year={2024}
}

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

MMLU-Pro

GeneralReasoning/MMLU-Pro

MMLU-Pro

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Tools

Compute Configuration

Estimated Cost

Examples