MMMLU
MMMLU
Description
MMMLU is an environment for evaluating multilingual massive multitask language understanding. Based on OpenAI's MMMLU dataset, it consists of professional human translations of MMLU into 14 languages. Agents answer 4-option multiple-choice questions (A/B/C/D) across 57 subject categories spanning STEM, humanities, social sciences, and professional domains.
Capabilities
- Multilingual knowledge reasoning across 14 languages
- Multiple-choice question answering across 57 subject categories
- Evaluation of understanding of professional human translations (not machine-translated)
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
License
Tasks
15 splits:
- 14 language-specific splits (ar_xy, bn_bd, de_de, es_la, fr_fr, hi_in, id_id, it_it, ja_jp, ko_kr, pt_br, sw_ke, yo_ng, zh_cn), each with 14,042 tasks
- 1 combined test split (196,588 tasks)
Total: ~196,588 unique questions.
Reward Structure
Single-turn evaluation. The agent submits an answer (A, B, C, or D) via the submit_answer tool. Reward is deterministic and based on exact match: 1.0 if correct, 0.0 if incorrect.
Data
15 parquet files (~100 MB total) sourced from Hugging Face openai/MMMLU. Data is stored on the OpenReward platform.
Tools
submit_answer — Submit an answer choice (A, B, C, or D).
Time Horizon
Single-turn.
Environment Difficulty
OpenAI evaluates models on MMMLU across 14 languages (Average Accuracy):
| Model | Accuracy |
|---|---|
| o3-high | 88.8% |
| o1 | 87.7% |
| gpt-4.5-preview | 85.1% |
| o4-mini-high | 85.2% |
| gpt-4.1 | 83.7% |
| gpt-4o | 81.4% |
| gpt-4.1-mini | 78.5% |
| gpt-4o-mini | 70.5% |
Other Environment Requirements
There are no further environment requirements; MMMLU works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in MMMLU answer multilingual multiple-choice questions in a standard environment. The environment does not present direct safety risks.
Citation
@article{hendrycks2021measuring,
title={Measuring Massive Multitask Language Understanding},
author={Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}