API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

Google/imo-bench

README

IMO-Bench

Description

IMO-Bench is an environment for evaluating agents on International Mathematical Olympiad (IMO) problems. It contains three sub-environments targeting different mathematical capabilities: AnswerBench (numerical answer extraction), GradingBench (solution grading), and ProofBench (proof generation). Problems span four IMO categories: Algebra, Combinatorics, Geometry, and Number Theory.

Capabilities

Solving mathematical olympiad problems requiring advanced reasoning
Generating rigorous mathematical proofs
Grading mathematical solutions for correctness
Reasoning across Algebra, Combinatorics, Geometry, and Number Theory

Compute Requirements

IMO-Bench does not require a sandbox. It has minimal compute requirements.

License

Apache 2.0.

Tasks

IMO-Bench contains three environment variants, each with 5 splits (all, Algebra, Combinatorics, Geometry, Number Theory). All splits are test-only. Total: 1,460 tasks.

AnswerBench (400 tasks): Problems with short numerical answers (100 per category). The agent solves the problem and submits an answer verified by the math_verify library.
ProofBench (60 tasks): Problems requiring full proof generation (30 basic + 30 advanced). The agent writes a proof that is graded on the IMO 0-7 scale by an LLM grader (gemini-2.5-pro).
GradingBench (1,000 tasks): Problems paired with a proposed solution and a ground-truth grade. The agent analyzes the solution and assigns a grade (incorrect, partial, almost, correct).

Reward Structure

This is a sparse reward environment. Each task requires exactly one tool call to the answer tool.

AnswerBench: Binary reward. 1.0 for a correct answer (verified by math_verify), 0.0 otherwise. No LLM graders.
GradingBench: Binary reward. 1.0 if the extracted grade matches the expected grade, 0.0 otherwise. A Gemini LLM (gemini-2.5-flash) is used as a fallback to extract the grade from the agent's response when direct parsing fails.
ProofBench: Continuous reward on the IMO scale. The proof is graded by an LLM grader (gemini-2.5-pro) which assigns a score from {0, 1, 6, 7} out of 7. Reward is the score divided by 7 (0.0 to 1.0).

Data

Problems are sourced from International Mathematical Olympiad competitions, stored as CSV files. Data files are stored on the OpenReward platform.

Tools

Agents are given a single tool across all three sub-environments:

answer: Submit an answer (numerical answer for AnswerBench, grading analysis for GradingBench, or proof for ProofBench). Returns the grade and score. This tool can only be called once per task.

Time Horizon

IMO-Bench consists of single-turn environments. The agent receives a math problem and submits one answer. Each task requires exactly one tool call.

Environment Difficulty

[Statistics on environment difficulty here]

Other Environment Requirements

GradingBench and ProofBench require a Google Gemini API key (GEMINI_API_KEY secret) for LLM-based grading. AnswerBench has no additional requirements.

Safety

Agents in IMO-Bench are asked to solve, grade, or prove mathematical problems. The environment does not present direct safety risks, as agents only provide text answers with no access to external systems, tools, or the internet.

Citations

@inproceedings{luong2025imobench,
  title={Towards Robust Mathematical Reasoning},
  author={Luong, Thang and Hwang, Dawsen and Nguyen, Hoang H. and Ghiasi, Golnaz and Chervonyi, Yuri and Seo, Insuk and Kim, Junsu and Bingham, Garrett and Lee, Jonathan and Mishra, Swaroop and Zhai, Alex and Hu, Clara Huiyi and Michalewski, Henryk and Kim, Jimin and Ahn, Jeonghyun and Bae, Junhwi and Song, Xingyou and Trinh, Trieu H. and Le, Quoc V. and Jung, Junehyuk},
  booktitle={Proceedings of EMNLP},
  year={2025},
  url={https://arxiv.org/abs/2511.01846}
}

Repository

Source repository

EnvCommons/IMO-Bench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

IMO-Bench

GeneralReasoning/IMO-Bench

IMO-Bench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples