AMO-Bench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

AMO-Bench

⭐ OpenReward Environment Hugging Face Dataset

Description

AMO-Bench (Advanced Mathematical Olympiad Benchmark) is an environment for evaluating mathematical reasoning at International Mathematical Olympiad (IMO) difficulty level or higher. Based on the AMO-Bench benchmark by Meituan LongCat, the benchmark contains 50 original, expert-crafted problems designed to prevent data memorization and performance saturation seen in existing math benchmarks like AIME. All problems are cross-validated by experts and require only a final answer (not a proof), enabling automatic grading. Problems span four answer types: numerical, set, variable (algebraic), and description.

Capabilities

  • Advanced mathematical reasoning at IMO competition level
  • Final-answer-based evaluation enabling automatic grading
  • Hybrid grading: deterministic math-verify parser for numerical/set/variable answers, LLM majority-vote grading for descriptive answers

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT.

Tasks

There is one split in this environment:

  • test: 50 tasks

Each task is one original mathematical problem spanning five categories: Functions & Sequences, Combinatorics, Algebraic Equations & Inequalities, Number Theory, and Geometry. Problems are distributed across four answer types:

  • Number: numerical answers
  • Description: descriptive/prose answers
  • Set: set-valued answers
  • Variable: algebraic/symbolic answers

Reward Structure

This is a single-turn environment. The agent submits a final answer via the answer tool. Reward is binary: 1.0 if correct, 0.0 otherwise. Grading uses a hybrid approach depending on answer type:

  • Number/Set types: Deterministic grading via the math-verify library, which parses and verifies mathematical expressions with float rounding to 4 decimal places.
  • Variable types: Deterministic grading via math-verify combined with SymPy's solve() to verify algebraic equivalence by substituting test values.
  • Description types: LLM-based grading using o4-mini with majority vote over 5 judge responses. Each judge determines whether the submitted answer is semantically equivalent to the reference answer.

For non-description types, the environment also attempts a truncated version of the extracted answer as a fallback, and accepts either match.

Data

Data is loaded from HuggingFace meituan-longcat/AMO-Bench at module import time using the datasets library. Each row contains a question ID, problem prompt in Markdown with LaTeX, expert solution, reference answer (often in \boxed{} format), and answer type.

Tools

ToolDescription
answerSubmit your final answer to the mathematical problem. Ends the episode.

Time Horizon

Single-turn. The agent receives one problem and submits one answer.

Environment Difficulty

AMO-Bench causes a significant accuracy drop compared to existing math benchmarks like AIME, with most models scoring below 40%.

ModelAMO-Bench Accuracy
Qwen3-Max-Thinking65.1%
Gemini 3 Pro63.1%
GLM-4.762.4%
Kimi-K2-Thinking56.0%
GPT-5-Thinking (High)52.4%
Qwen3-235B-A22B-Thinking47.8%
DeepSeek-V3.1-Thinking47.6%
o4-mini (High)40.2%

Other Environment Requirements

  • OpenAI API key: Required for LLM-based grading of description-type answers and used by the o4-mini judge. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in AMO-Bench answer mathematical problems in a standard environment. The environment does not present direct safety risks.

Citation

@article{an2025amobench,
  title={AMO-Bench: Large Language Models Still Struggle in High School Math Competitions},
  author={An, Shengnan and Cai, Xunliang and Cao, Xuezhi and Li, Xiaoyu and Lin, Yehao and Liu, Junlin and Lv, Xinxuan and Ma, Dan and Wang, Xuanlin and Wang, Ziwen and Zhou, Shuang},
  journal={arXiv preprint arXiv:2510.26768},
  year={2025},
  url={https://arxiv.org/abs/2510.26768}
}
GeneralReasoning/AMO-Bench | OpenReward