API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/polymath

README

PolyMath

Description

PolyMath is an environment for evaluating multilingual mathematical reasoning. It covers 18 languages across 4 difficulty levels (low, medium, high, top), with 125 problems per language at each level, totaling 9,000 high-quality problems. Problems range from K-12 to Olympiad and advanced frontier mathematics, with translations calibrated by language experts.

Capabilities

Solving mathematical problems across a broad difficulty range
Reasoning in 18 languages: Arabic, Bengali, German, English, Spanish, French, Indonesian, Italian, Japanese, Korean, Malay, Portuguese, Russian, Swahili, Telugu, Thai, Vietnamese, Chinese
Handling mathematical notation and expressions

Compute Requirements

This is a single-turn environment with no sandbox.

Tasks

There is one split in this environment:

Test: 9,000 problems (18 languages x 4 difficulty levels x 125 problems)

Each task presents a math problem in a specific language at a specific difficulty level. The agent must provide a final answer.

Reward Structure

This is a single-turn environment with binary reward:

1.0 — Correct answer
0.0 — Incorrect answer

Evaluation uses the math_verify library for mathematical equivalence checking, with fallback to normalized string comparison on parsing errors. No LLM grading is used — evaluation is deterministic.

Data

Data is organized as 72 Parquet files (one per language-difficulty combination, e.g., en_high_0000.parquet). Each file contains 125 problems with question text, ground truth answer, language code, and difficulty level.

Source: Qwen/PolyMath

Tools

Tool	Description
`answer`	Submit your final answer to the math problem. Evaluated for mathematical equivalence using math_verify.

Time Horizon

PolyMath is a single-turn environment. The agent receives a math problem and submits one answer for a total of one tool call.

Environment Difficulty

The original paper (NeurIPS 2025 D&B Track) evaluates 44 models using Difficulty-Weighted Accuracy:

Model	DW-ACC Score
Qwen3-235B-A22B-Thinking	54.6
Gemini-2.5-pro	52.2
Qwen3-32B-Thinking	47.4

Even top models achieve only ~40% accuracy at the highest difficulty level. Performance varies by up to 10 points across languages, revealing significant challenges in multilingual mathematical reasoning.

Other Environment Requirements

There are no further environment requirements; PolyMath works out of the box with the OpenReward endpoint without any secrets.

Safety

This environment evaluates mathematical reasoning and does not present direct safety risks. Agents only solve math problems and submit answers.

Citations

@article{wang2025polymath,
  author    = {Yiming Wang and Pei Zhang and Jialong Tang and Haoran Wei and Baosong Yang and Rui Wang and Chenshu Sun and Feitong Sun and Jiran Zhang and Junxuan Wu and Qiqian Cang and Yichang Zhang and Fei Huang and Junyang Lin and Fei Huang and Jingren Zhou},
  title     = {PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts},
  journal   = {arXiv preprint arXiv:2504.18428},
  year      = {2025},
  url       = {https://arxiv.org/abs/2504.18428}
}

Repository

Source repository

EnvCommons/polymath

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

polymath

GeneralReasoning/polymath

PolyMath

Description

Capabilities

Compute Requirements

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples