EMMA

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

EMMA

⭐ OpenReward Environment Hugging Face Dataset

Description

EMMA (Enhanced MultiModal reAsoning) is an environment for evaluating expert-level multimodal reasoning across mathematics, physics, chemistry, and coding. Tasks require integrated visual and textual reasoning that cannot be solved by reasoning independently in each modality. Questions are multiple-choice with embedded images.

Capabilities

  • Multimodal reasoning across text and images
  • Expert-level problem solving in chemistry, coding, math, and physics
  • Interpreting diagrams, equations, charts, and scientific figures

Compute Requirements

This is a single-turn environment with no sandbox.

Tasks

There is one split in this environment:

  • Test: 2,788 tasks across four subjects:
    • Chemistry: 1,176
    • Coding: 564
    • Math: 892
    • Physics: 156

Each task presents a multiple-choice question (A–E) with one or more embedded images. The agent must select the correct answer letter.

Reward Structure

This is a single-turn environment with binary reward:

  • 1.0 — Correct answer letter selected
  • 0.0 — Incorrect answer

Evaluation is deterministic exact match on the answer letter (case-insensitive). No LLM grading is used. The environment accepts flexible input formats (e.g., "A", "A)", "A.").

Data

Data is stored as four Parquet files (one per subject: chemistry.parquet, coding.parquet, math.parquet, physics.parquet). Each file contains questions with embedded images (base64-encoded PNG), multiple-choice options, and correct answer letters. The environment uses lazy loading for memory efficiency.

Source: luckychao/EMMA

Tools

ToolDescription
submit_answerSubmit your answer letter (A, B, C, D, or E) for the multiple-choice question.

Time Horizon

EMMA is a single-turn environment. The agent receives a multimodal question and submits one answer letter for a total of one tool call.

Environment Difficulty

The original paper (ICML 2025 Oral) evaluates state-of-the-art MLLMs on EMMA:

ModelEMMA-miniFull EMMA
o145.75%-
Gemini 2.0 Flash Thinking-38.06%
Claude 3.5 Sonnet-37.23%
Human Expert77.75%-

Human experts outperform all models by 32+ percentage points. Chain-of-thought prompting shows divergent effects: improving closed-source models while reducing open-source model accuracy.

Other Environment Requirements

There are no further environment requirements; EMMA works out of the box with the OpenReward endpoint without any secrets.

Safety

This environment evaluates multimodal reasoning on academic problems and does not present direct safety risks.

Citations

@article{hao2025emma,
  author    = {Yunzhuo Hao and Jiawei Gu and Huichen Will Wang and Linjie Li and Zhengyuan Yang and Lijuan Wang and Yu Cheng},
  title     = {Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark},
  journal   = {arXiv preprint arXiv:2501.05444},
  year      = {2025},
  url       = {https://arxiv.org/abs/2501.05444}
}
GeneralReasoning/EMMA | OpenReward