API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/mathcanvas-bench

README

MathCanvas

Description

MathCanvas is an environment for evaluating Visual Chain-of-Thought (VCoT) capabilities on multimodal mathematical reasoning tasks. It contains 3,079 problems with interleaved text and images (diagrams, graphs, geometric figures) across 8 mathematical domains. The benchmark tests models' ability to reason about visual mathematical content spanning high school to undergraduate level.

Capabilities

Multimodal mathematical reasoning with interleaved text and images
Visual Chain-of-Thought evaluation across geometry, calculus, algebra, and more
GPT-based flexible answer grading for equivalent mathematical expressions

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

Apache 2.0.

Tasks

There is one split in this environment:

test: 3,079 tasks

Tasks span 8 mathematical domains:

Domain	Description
Algebra	Algebraic manipulation and equations
Analytic Geometry	Coordinate geometry and curves
Calculus & Vector	Differentiation, integration, vectors
Plane Geometry	2D geometric reasoning
Solid Geometry	3D spatial reasoning
Statistics	Probability and data analysis
Transformational Geometry	Geometric transformations
Trigonometry	Trigonometric functions and identities

Reward Structure

Single-turn evaluation with LLM-graded rewards. The agent submits an answer via the submit_answer tool. The answer is graded by gpt-5-mini which evaluates mathematical equivalence across different representations (fractions, decimals, equivalent expressions). Reward is 1.0 if correct, 0.0 if incorrect.

Data

test.parquet (327 MB, 3,079 problems) sourced from HuggingFace shiwk24/MathCanvas-Bench. Stored on the OpenReward platform.

Tools

Tool	Description
`submit_answer`	Submit a mathematical answer. LLM-graded for mathematical equivalence. Ends the episode.

Time Horizon

Single-turn. The agent reads the multimodal problem (text and images) and submits one answer.

Environment Difficulty

MathCanvas evaluates multimodal mathematical reasoning with visual chain-of-thought:

Model	Weighted Score
Gemini-2.5-Pro	69.9%
GPT-5	66.5%
Gemini-2.5-Flash	64.6%
Seed-1.6-Thinking	60.7%
GLM-4.5V	59.8%
Qwen3-VL-Plus	58.9%
Claude-Sonnet-4	47.6%
Qwen-2.5-VL-72B	48.9%

Even frontier multimodal models achieve under 70% accuracy, demonstrating the challenge of visual mathematical reasoning.

Other Environment Requirements

OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."} when creating a session.

Safety

Agents in MathCanvas solve multimodal mathematics problems in a standard environment. The environment does not present direct safety risks.

Citation

@misc{shi2025mathcanvasintrinsicvisualchainofthought,
  title={MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning},
  author={Weikang Shi and Aldrich Yu and Rongyao Fang and Houxing Ren and Ke Wang and Aojun Zhou and Changyao Tian and Xinyu Fu and Yuxuan Hu and Zimu Lu and Linjiang Huang and Si Liu and Rui Liu and Hongsheng Li},
  year={2025},
  eprint={2510.14958},
  archivePrefix={arXiv}
}

Repository

Source repository

EnvCommons/MathCanvas

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

MathCanvas

GeneralReasoning/MathCanvas

MathCanvas

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples