API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/mmlu-prox

README

MMLU-ProX

Description

MMLU-ProX is an environment for evaluating agents on multilingual multiple-choice question answering. It is based on the MMLU-ProX dataset from HuggingFace (li-lab/MMLU-ProX), which extends MMLU-Pro to 29 languages. Each task presents a question with 10 answer options (A through J) across 14+ subject categories. Grading is deterministic via exact match.

Capabilities

Multilingual multiple-choice question answering across 29 languages
Knowledge reasoning across 14+ subject categories (mathematics, science, health, business, humanities, etc.)
Single-turn evaluation with deterministic grading

Compute Requirements

MMLU-ProX extends Environment directly and does not require a sandbox. It has minimal compute requirements.

License

MIT.

Tasks

There are 58 splits (29 languages x 2 split types) in the format {language}_{split}:

Validation: 70 examples per language (2,030 total)
Test: ~11,800 examples per language (~341,011 total)
Total: 343,041 examples

Languages: af, ar, bn, cs, de, en, es, fr, hi, hu, id, it, ja, ko, mr, ne, pt, ru, sr, sw, te, th, uk, ur, vi, wo, yo, zh, zu.

Questions span 14+ subject areas including mathematics, science, health, business, humanities, computer science, law, and more.

Reward Structure

This is a sparse, verifiable reward environment with binary scoring. The agent calls submit_answer once with a letter (A-J). The answer is compared via exact match against the correct answer:

Correct: Reward 1.0.
Incorrect: Reward 0.0.

We do not use LLM graders for this task.

Data

Questions are sourced from the li-lab/MMLU-ProX HuggingFace dataset, consolidated into a single parquet file for efficient loading via predicate pushdown. Data files are stored on the OpenReward platform.

Tools

Agents are given a single tool:

submit_answer: Submit an answer letter (A through J) for the current question. Returns whether the answer is correct. This tool can only be called once per task.

Time Horizon

MMLU-ProX is a single-turn environment. The agent receives a question with 10 options and submits one answer. Each task requires exactly one tool call.

Environment Difficulty

Model performance on MMLU-ProX from the original paper (5-shot CoT):

Model	English	Swahili
QwQ-32B	70.7%	32.8%
Qwen2.5-72B	70.3%	40.1%
Llama3.1-405B	68.8%	52.1%

Performance degrades significantly from high-resource to low-resource languages, with gaps of up to 30% between English and Swahili.

Other Environment Requirements

There are no further environment requirements; MMLU-ProX works out of the box with the OpenReward endpoint without any secrets.

Safety

Agents in MMLU-ProX are asked to answer multiple-choice knowledge questions. The environment does not present direct safety risks, as agents only provide letter answers with no access to external systems, tools, or the internet.

Citation

@inproceedings{xuan2025mmluprox,
  title={MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation},
  author={Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Aosong and Liu, Dairui and Xing, Yun and Wang, Junjue and Gao, Fan and others},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2025},
  url={https://arxiv.org/abs/2503.10497}
}

Repository

Source repository

EnvCommons/MMLU-ProX

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

MMLU-ProX

GeneralReasoning/MMLU-ProX

MMLU-ProX

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples