MathVision
MathVision
Description
MathVision (MATH-V) is an environment for evaluating multimodal mathematical reasoning. It contains 3,040 high-quality problems from real math competitions requiring visual understanding of diagrams, charts, and plots across 16 mathematical disciplines and 5 difficulty levels. Problems include both multiple-choice and open-ended formats.
Capabilities
- Multimodal mathematical reasoning with visual contexts
- Understanding geometry diagrams, function plots, and charts
- Solving competition-level math problems
- Algebraic manipulation and numerical computation
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
License
Tasks
There are two splits in this environment:
- test: 3,040 tasks
- testmini: 304 tasks (quick testing subset)
Tasks span 16 mathematical disciplines (geometry, algebra, combinatorics, etc.) across 5 difficulty levels. Each task includes a visual context (diagram, chart, or plot) paired with a mathematical question.
Reward Structure
This is a single-turn environment. The agent submits an answer via the submit_answer tool. Multiple-choice questions use exact letter matching with format variations (e.g., "A", "Option A", "The answer is A"). Open-ended questions use LLM grading (gpt-5-mini) for mathematical equivalence checking. Reward is binary: 1.0 if correct, 0.0 if incorrect.
Data
Data consists of Parquet files (mathvision_test.parquet, mathvision_testmini.parquet) sourced from HuggingFace MathLLMs/MathVision. Each row contains a question, image, answer, difficulty level, and subject. Data is stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
submit_answer | Submit your final answer (letter for multiple-choice, or value for open-ended). Ends the episode. |
Time Horizon
Single-turn. The agent views the image and question, then submits one answer.
Environment Difficulty
Problems span 5 difficulty levels from AMC-8 through Olympiad, requiring both visual parsing of diagrams and multi-step mathematical reasoning. Even strong multimodal models leave significant room for improvement:
| Model | Accuracy |
|---|---|
| Gemini-2.5-Pro | 67.5% |
| GPT-5.2 | 63.9% |
| Claude Opus 4.5 | 54.2% |
| GPT-4o | 30.4% |
Other Environment Requirements
OpenAI API key required for LLM-based grading of open-ended questions. Pass via secrets={"openai_api_key": "..."}.
Safety
Agents in MathVision solve multimodal math problems in a standard environment. The environment does not present direct safety risks.
Citation
@article{wang2024mathvision,
title={Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset},
author={Wang, Ke and Yang, Junting and Wu, Weiming and Yang, Xudong and Wang, Xingyi and Ni, Hao and Xie, Chuanyang and Hu, Jia and Zhu, Liang and Peng, Minqi and others},
journal={arXiv preprint arXiv:2402.14804},
year={2024}
}