EMMA
EMMA
Description
EMMA (Enhanced MultiModal reAsoning) is an environment for evaluating expert-level multimodal reasoning across mathematics, physics, chemistry, and coding. Tasks require integrated visual and textual reasoning that cannot be solved by reasoning independently in each modality. Questions are multiple-choice with embedded images.
Capabilities
- Multimodal reasoning across text and images
- Expert-level problem solving in chemistry, coding, math, and physics
- Interpreting diagrams, equations, charts, and scientific figures
Compute Requirements
This is a single-turn environment with no sandbox.
Tasks
There is one split in this environment:
- Test: 2,788 tasks across four subjects:
- Chemistry: 1,176
- Coding: 564
- Math: 892
- Physics: 156
Each task presents a multiple-choice question (A–E) with one or more embedded images. The agent must select the correct answer letter.
Reward Structure
This is a single-turn environment with binary reward:
- 1.0 — Correct answer letter selected
- 0.0 — Incorrect answer
Evaluation is deterministic exact match on the answer letter (case-insensitive). No LLM grading is used. The environment accepts flexible input formats (e.g., "A", "A)", "A.").
Data
Data is stored as four Parquet files (one per subject: chemistry.parquet, coding.parquet, math.parquet, physics.parquet). Each file contains questions with embedded images (base64-encoded PNG), multiple-choice options, and correct answer letters. The environment uses lazy loading for memory efficiency.
Source: luckychao/EMMA
Tools
| Tool | Description |
|---|---|
submit_answer | Submit your answer letter (A, B, C, D, or E) for the multiple-choice question. |
Time Horizon
EMMA is a single-turn environment. The agent receives a multimodal question and submits one answer letter for a total of one tool call.
Environment Difficulty
The original paper (ICML 2025 Oral) evaluates state-of-the-art MLLMs on EMMA:
| Model | EMMA-mini | Full EMMA |
|---|---|---|
| o1 | 45.75% | - |
| Gemini 2.0 Flash Thinking | - | 38.06% |
| Claude 3.5 Sonnet | - | 37.23% |
| Human Expert | 77.75% | - |
Human experts outperform all models by 32+ percentage points. Chain-of-thought prompting shows divergent effects: improving closed-source models while reducing open-source model accuracy.
Other Environment Requirements
There are no further environment requirements; EMMA works out of the box with the OpenReward endpoint without any secrets.
Safety
This environment evaluates multimodal reasoning on academic problems and does not present direct safety risks.
Citations
@article{hao2025emma,
author = {Yunzhuo Hao and Jiawei Gu and Huichen Will Wang and Linjie Li and Zhengyuan Yang and Lijuan Wang and Yu Cheng},
title = {Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark},
journal = {arXiv preprint arXiv:2501.05444},
year = {2025},
url = {https://arxiv.org/abs/2501.05444}
}