API Endpoint

Leaderboard

Loading leaderboard...

README

MMMLU

Description

MMMLU is an environment for evaluating multilingual massive multitask language understanding. Based on OpenAI's MMMLU dataset, it consists of professional human translations of MMLU into 14 languages. Agents answer 4-option multiple-choice questions (A/B/C/D) across 57 subject categories spanning STEM, humanities, social sciences, and professional domains.

Capabilities

Multilingual knowledge reasoning across 14 languages
Multiple-choice question answering across 57 subject categories
Evaluation of understanding of professional human translations (not machine-translated)

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT

Tasks

15 splits:

14 language-specific splits (ar_xy, bn_bd, de_de, es_la, fr_fr, hi_in, id_id, it_it, ja_jp, ko_kr, pt_br, sw_ke, yo_ng, zh_cn), each with 14,042 tasks
1 combined test split (196,588 tasks)

Total: ~196,588 unique questions.

Reward Structure

Single-turn evaluation. The agent submits an answer (A, B, C, or D) via the submit_answer tool. Reward is deterministic and based on exact match: 1.0 if correct, 0.0 if incorrect.

Data

15 parquet files (~100 MB total) sourced from Hugging Face openai/MMMLU. Data is stored on the OpenReward platform.

Tools

submit_answer — Submit an answer choice (A, B, C, or D).

Time Horizon

Single-turn.

Environment Difficulty

OpenAI evaluates models on MMMLU across 14 languages (Average Accuracy):

Model	Accuracy
o3-high	88.8%
o1	87.7%
gpt-4.5-preview	85.1%
o4-mini-high	85.2%
gpt-4.1	83.7%
gpt-4o	81.4%
gpt-4.1-mini	78.5%
gpt-4o-mini	70.5%

Other Environment Requirements

There are no further environment requirements; MMMLU works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in MMMLU answer multilingual multiple-choice questions in a standard environment. The environment does not present direct safety risks.

Citation

@article{hendrycks2021measuring,
  title={Measuring Massive Multitask Language Understanding},
  author={Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

Repository

Source repository

EnvCommons/MMMLU

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

MMMLU

GeneralReasoning/MMMLU

MMMLU

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples