MathCanvas
MathCanvas
Description
MathCanvas is an environment for evaluating Visual Chain-of-Thought (VCoT) capabilities on multimodal mathematical reasoning tasks. It contains 3,079 problems with interleaved text and images (diagrams, graphs, geometric figures) across 8 mathematical domains. The benchmark tests models' ability to reason about visual mathematical content spanning high school to undergraduate level.
Capabilities
- Multimodal mathematical reasoning with interleaved text and images
- Visual Chain-of-Thought evaluation across geometry, calculus, algebra, and more
- GPT-based flexible answer grading for equivalent mathematical expressions
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
License
Tasks
There is one split in this environment:
- test: 3,079 tasks
Tasks span 8 mathematical domains:
| Domain | Description |
|---|---|
| Algebra | Algebraic manipulation and equations |
| Analytic Geometry | Coordinate geometry and curves |
| Calculus & Vector | Differentiation, integration, vectors |
| Plane Geometry | 2D geometric reasoning |
| Solid Geometry | 3D spatial reasoning |
| Statistics | Probability and data analysis |
| Transformational Geometry | Geometric transformations |
| Trigonometry | Trigonometric functions and identities |
Reward Structure
Single-turn evaluation with LLM-graded rewards. The agent submits an answer via the submit_answer tool. The answer is graded by gpt-5-mini which evaluates mathematical equivalence across different representations (fractions, decimals, equivalent expressions). Reward is 1.0 if correct, 0.0 if incorrect.
Data
test.parquet (327 MB, 3,079 problems) sourced from HuggingFace shiwk24/MathCanvas-Bench. Stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
submit_answer | Submit a mathematical answer. LLM-graded for mathematical equivalence. Ends the episode. |
Time Horizon
Single-turn. The agent reads the multimodal problem (text and images) and submits one answer.
Environment Difficulty
MathCanvas evaluates multimodal mathematical reasoning with visual chain-of-thought:
| Model | Weighted Score |
|---|---|
| Gemini-2.5-Pro | 69.9% |
| GPT-5 | 66.5% |
| Gemini-2.5-Flash | 64.6% |
| Seed-1.6-Thinking | 60.7% |
| GLM-4.5V | 59.8% |
| Qwen3-VL-Plus | 58.9% |
| Claude-Sonnet-4 | 47.6% |
| Qwen-2.5-VL-72B | 48.9% |
Even frontier multimodal models achieve under 70% accuracy, demonstrating the challenge of visual mathematical reasoning.
Other Environment Requirements
OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."} when creating a session.
Safety
Agents in MathCanvas solve multimodal mathematics problems in a standard environment. The environment does not present direct safety risks.
Citation
@misc{shi2025mathcanvasintrinsicvisualchainofthought,
title={MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning},
author={Weikang Shi and Aldrich Yu and Rongyao Fang and Houxing Ren and Ke Wang and Aojun Zhou and Changyao Tian and Xinyu Fu and Yuxuan Hu and Zimu Lu and Linjiang Huang and Si Liu and Rui Liu and Hongsheng Li},
year={2025},
eprint={2510.14958},
archivePrefix={arXiv}
}