MATH
MATH
Description
MATH is an environment for evaluating mathematical reasoning across algebra, geometry, number theory, counting & probability, and calculus. Based on the MATH dataset from Hendrycks et al., agents solve competition-level mathematics problems with optional Python code execution support. Answer verification uses the math-verify library for semantic equivalence checking.
Capabilities
- Competition-level mathematical problem solving
- Multi-step mathematical reasoning
- Python code execution for computational assistance
- Support for LaTeX answer formatting
- Symbolic answer verification
Compute Requirements
The Math environment provides a sandbox with Python code execution (0.5 CPU, 1GB RAM). The MathNoCode variant requires no sandbox.
License
Tasks
There are two splits in this environment:
- train: 7,500 training problems
- test: 5,000 test problems
Problems span 8 categories: Algebra, Intermediate Algebra, Prealgebra, Geometry, Number Theory, Counting & Probability, Arithmetic, and Precalculus. Each problem has a difficulty level from 1-5.
Reward Structure
This is a sparse, verifiable reward environment. The agent calls answer to submit a solution:
- 1.0: Answer is mathematically equivalent to the reference solution
- 0.0: Answer is incorrect
Answer verification uses the math-verify library to check semantic equivalence, handling LaTeX formatting and mathematical expressions.
Data
Data is sourced from the DigitalLearningGmbH/MATH-lighteval HuggingFace dataset.
Tools
| Tool | Description |
|---|---|
answer | Submit final answer for verification |
execute_code | Execute Python code (Math environment only) |
Time Horizon
Single-turn for MathNoCode, multi-turn for Math (with code execution).
Environment Difficulty
MATH Level 5 (hardest problems):
| Model | Accuracy |
|---|---|
| GPT-5 (high) | 98.1% |
| GPT-5 (medium) | 97.9% |
| o4-mini (high) | 97.8% |
| o3 (high) | 97.8% |
| Claude Sonnet 4.5 | 97.7% |
Other Environment Requirements
There are no further environment requirements; MATH works out of the box with the OpenReward endpoint.
Safety
Agents in MATH solve mathematical problems with optional code execution in an isolated sandbox. The environment does not present direct safety risks.
Citation
@article{hendrycks2021measuring,
title={Measuring Mathematical Problem Solving With the MATH Dataset},
author={Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arber, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob},
journal={NeurIPS},
year={2021}
}