MMLU-ProX
MMLU-ProX
Description
MMLU-ProX is an environment for evaluating agents on multilingual multiple-choice question answering. It is based on the MMLU-ProX dataset from HuggingFace (li-lab/MMLU-ProX), which extends MMLU-Pro to 29 languages. Each task presents a question with 10 answer options (A through J) across 14+ subject categories. Grading is deterministic via exact match.
Capabilities
- Multilingual multiple-choice question answering across 29 languages
- Knowledge reasoning across 14+ subject categories (mathematics, science, health, business, humanities, etc.)
- Single-turn evaluation with deterministic grading
Compute Requirements
MMLU-ProX extends Environment directly and does not require a sandbox. It has minimal compute requirements.
License
MIT.
Tasks
There are 58 splits (29 languages x 2 split types) in the format {language}_{split}:
- Validation: 70 examples per language (2,030 total)
- Test: ~11,800 examples per language (~341,011 total)
- Total: 343,041 examples
Languages: af, ar, bn, cs, de, en, es, fr, hi, hu, id, it, ja, ko, mr, ne, pt, ru, sr, sw, te, th, uk, ur, vi, wo, yo, zh, zu.
Questions span 14+ subject areas including mathematics, science, health, business, humanities, computer science, law, and more.
Reward Structure
This is a sparse, verifiable reward environment with binary scoring. The agent calls submit_answer once with a letter (A-J). The answer is compared via exact match against the correct answer:
- Correct: Reward 1.0.
- Incorrect: Reward 0.0.
We do not use LLM graders for this task.
Data
Questions are sourced from the li-lab/MMLU-ProX HuggingFace dataset, consolidated into a single parquet file for efficient loading via predicate pushdown. Data files are stored on the OpenReward platform.
Tools
Agents are given a single tool:
submit_answer: Submit an answer letter (A through J) for the current question. Returns whether the answer is correct. This tool can only be called once per task.
Time Horizon
MMLU-ProX is a single-turn environment. The agent receives a question with 10 options and submits one answer. Each task requires exactly one tool call.
Environment Difficulty
Model performance on MMLU-ProX from the original paper (5-shot CoT):
| Model | English | Swahili |
|---|---|---|
| QwQ-32B | 70.7% | 32.8% |
| Qwen2.5-72B | 70.3% | 40.1% |
| Llama3.1-405B | 68.8% | 52.1% |
Performance degrades significantly from high-resource to low-resource languages, with gaps of up to 30% between English and Swahili.
Other Environment Requirements
There are no further environment requirements; MMLU-ProX works out of the box with the OpenReward endpoint without any secrets.
Safety
Agents in MMLU-ProX are asked to answer multiple-choice knowledge questions. The environment does not present direct safety risks, as agents only provide letter answers with no access to external systems, tools, or the internet.
Citation
@inproceedings{xuan2025mmluprox,
title={MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation},
author={Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Aosong and Liu, Dairui and Xing, Yun and Wang, Junjue and Gao, Fan and others},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2025},
url={https://arxiv.org/abs/2503.10497}
}