API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/emma

README

EMMA

Description

EMMA (Enhanced MultiModal reAsoning) is an environment for evaluating expert-level multimodal reasoning across mathematics, physics, chemistry, and coding. Tasks require integrated visual and textual reasoning that cannot be solved by reasoning independently in each modality. Questions are multiple-choice with embedded images.

Capabilities

Multimodal reasoning across text and images
Expert-level problem solving in chemistry, coding, math, and physics
Interpreting diagrams, equations, charts, and scientific figures

Compute Requirements

This is a single-turn environment with no sandbox.

Tasks

There is one split in this environment:

Test: 2,788 tasks across four subjects:
- Chemistry: 1,176
- Coding: 564
- Math: 892
- Physics: 156

Each task presents a multiple-choice question (A–E) with one or more embedded images. The agent must select the correct answer letter.

Reward Structure

This is a single-turn environment with binary reward:

1.0 — Correct answer letter selected
0.0 — Incorrect answer

Evaluation is deterministic exact match on the answer letter (case-insensitive). No LLM grading is used. The environment accepts flexible input formats (e.g., "A", "A)", "A.").

Data

Data is stored as four Parquet files (one per subject: chemistry.parquet, coding.parquet, math.parquet, physics.parquet). Each file contains questions with embedded images (base64-encoded PNG), multiple-choice options, and correct answer letters. The environment uses lazy loading for memory efficiency.

Source: luckychao/EMMA

Tools

Tool	Description
`submit_answer`	Submit your answer letter (A, B, C, D, or E) for the multiple-choice question.

Time Horizon

EMMA is a single-turn environment. The agent receives a multimodal question and submits one answer letter for a total of one tool call.

Environment Difficulty

The original paper (ICML 2025 Oral) evaluates state-of-the-art MLLMs on EMMA:

Model	EMMA-mini	Full EMMA
o1	45.75%	-
Gemini 2.0 Flash Thinking	-	38.06%
Claude 3.5 Sonnet	-	37.23%
Human Expert	77.75%	-

Human experts outperform all models by 32+ percentage points. Chain-of-thought prompting shows divergent effects: improving closed-source models while reducing open-source model accuracy.

Other Environment Requirements

There are no further environment requirements; EMMA works out of the box with the OpenReward endpoint without any secrets.

Safety

This environment evaluates multimodal reasoning on academic problems and does not present direct safety risks.

Citations

@article{hao2025emma,
  author    = {Yunzhuo Hao and Jiawei Gu and Huichen Will Wang and Linjie Li and Zhengyuan Yang and Lijuan Wang and Yu Cheng},
  title     = {Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark},
  journal   = {arXiv preprint arXiv:2501.05444},
  year      = {2025},
  url       = {https://arxiv.org/abs/2501.05444}
}

Repository

Source repository

EnvCommons/EMMA

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

EMMA

GeneralReasoning/EMMA

EMMA

Description

Capabilities

Compute Requirements

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples