MathVision

API Endpoint
Leaderboard
Loading leaderboard...
README

MathVision

OpenReward Environment Hugging Face Dataset

Description

MathVision (MATH-V) is an environment for evaluating multimodal mathematical reasoning. It contains 3,040 high-quality problems from real math competitions requiring visual understanding of diagrams, charts, and plots across 16 mathematical disciplines and 5 difficulty levels. Problems include both multiple-choice and open-ended formats.

Capabilities

  • Multimodal mathematical reasoning with visual contexts
  • Understanding geometry diagrams, function plots, and charts
  • Solving competition-level math problems
  • Algebraic manipulation and numerical computation

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT License.

Tasks

There are two splits in this environment:

  • test: 3,040 tasks
  • testmini: 304 tasks (quick testing subset)

Tasks span 16 mathematical disciplines (geometry, algebra, combinatorics, etc.) across 5 difficulty levels. Each task includes a visual context (diagram, chart, or plot) paired with a mathematical question.

Reward Structure

This is a single-turn environment. The agent submits an answer via the submit_answer tool. Multiple-choice questions use exact letter matching with format variations (e.g., "A", "Option A", "The answer is A"). Open-ended questions use LLM grading (gpt-5-mini) for mathematical equivalence checking. Reward is binary: 1.0 if correct, 0.0 if incorrect.

Data

Data consists of Parquet files (mathvision_test.parquet, mathvision_testmini.parquet) sourced from HuggingFace MathLLMs/MathVision. Each row contains a question, image, answer, difficulty level, and subject. Data is stored on the OpenReward platform.

Tools

ToolDescription
submit_answerSubmit your final answer (letter for multiple-choice, or value for open-ended). Ends the episode.

Time Horizon

Single-turn. The agent views the image and question, then submits one answer.

Environment Difficulty

Problems span 5 difficulty levels from AMC-8 through Olympiad, requiring both visual parsing of diagrams and multi-step mathematical reasoning. Even strong multimodal models leave significant room for improvement:

ModelAccuracy
Gemini-2.5-Pro67.5%
GPT-5.263.9%
Claude Opus 4.554.2%
GPT-4o30.4%

Other Environment Requirements

OpenAI API key required for LLM-based grading of open-ended questions. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in MathVision solve multimodal math problems in a standard environment. The environment does not present direct safety risks.

Citation

@article{wang2024mathvision,
  title={Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset},
  author={Wang, Ke and Yang, Junting and Wu, Weiming and Yang, Xudong and Wang, Xingyi and Ni, Hao and Xie, Chuanyang and Hu, Jia and Zhu, Liang and Peng, Minqi and others},
  journal={arXiv preprint arXiv:2402.14804},
  year={2024}
}
GeneralReasoning/MathVision | OpenReward