polymath
PolyMath
Description
PolyMath is an environment for evaluating multilingual mathematical reasoning. It covers 18 languages across 4 difficulty levels (low, medium, high, top), with 125 problems per language at each level, totaling 9,000 high-quality problems. Problems range from K-12 to Olympiad and advanced frontier mathematics, with translations calibrated by language experts.
Capabilities
- Solving mathematical problems across a broad difficulty range
- Reasoning in 18 languages: Arabic, Bengali, German, English, Spanish, French, Indonesian, Italian, Japanese, Korean, Malay, Portuguese, Russian, Swahili, Telugu, Thai, Vietnamese, Chinese
- Handling mathematical notation and expressions
Compute Requirements
This is a single-turn environment with no sandbox.
Tasks
There is one split in this environment:
- Test: 9,000 problems (18 languages x 4 difficulty levels x 125 problems)
Each task presents a math problem in a specific language at a specific difficulty level. The agent must provide a final answer.
Reward Structure
This is a single-turn environment with binary reward:
- 1.0 — Correct answer
- 0.0 — Incorrect answer
Evaluation uses the math_verify library for mathematical equivalence checking, with fallback to normalized string comparison on parsing errors. No LLM grading is used — evaluation is deterministic.
Data
Data is organized as 72 Parquet files (one per language-difficulty combination, e.g., en_high_0000.parquet). Each file contains 125 problems with question text, ground truth answer, language code, and difficulty level.
Source: Qwen/PolyMath
Tools
| Tool | Description |
|---|---|
answer | Submit your final answer to the math problem. Evaluated for mathematical equivalence using math_verify. |
Time Horizon
PolyMath is a single-turn environment. The agent receives a math problem and submits one answer for a total of one tool call.
Environment Difficulty
The original paper (NeurIPS 2025 D&B Track) evaluates 44 models using Difficulty-Weighted Accuracy:
| Model | DW-ACC Score |
|---|---|
| Qwen3-235B-A22B-Thinking | 54.6 |
| Gemini-2.5-pro | 52.2 |
| Qwen3-32B-Thinking | 47.4 |
Even top models achieve only ~40% accuracy at the highest difficulty level. Performance varies by up to 10 points across languages, revealing significant challenges in multilingual mathematical reasoning.
Other Environment Requirements
There are no further environment requirements; PolyMath works out of the box with the OpenReward endpoint without any secrets.
Safety
This environment evaluates mathematical reasoning and does not present direct safety risks. Agents only solve math problems and submit answers.
Citations
@article{wang2025polymath,
author = {Yiming Wang and Pei Zhang and Jialong Tang and Haoran Wei and Baosong Yang and Rui Wang and Chenshu Sun and Feitong Sun and Jiran Zhang and Junxuan Wu and Qiqian Cang and Yichang Zhang and Fei Huang and Junyang Lin and Fei Huang and Jingren Zhou},
title = {PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts},
journal = {arXiv preprint arXiv:2504.18428},
year = {2025},
url = {https://arxiv.org/abs/2504.18428}
}