AMO-Bench
AMO-Bench
Description
AMO-Bench (Advanced Mathematical Olympiad Benchmark) is an environment for evaluating mathematical reasoning at International Mathematical Olympiad (IMO) difficulty level or higher. Based on the AMO-Bench benchmark by Meituan LongCat, the benchmark contains 50 original, expert-crafted problems designed to prevent data memorization and performance saturation seen in existing math benchmarks like AIME. All problems are cross-validated by experts and require only a final answer (not a proof), enabling automatic grading. Problems span four answer types: numerical, set, variable (algebraic), and description.
Capabilities
- Advanced mathematical reasoning at IMO competition level
- Final-answer-based evaluation enabling automatic grading
- Hybrid grading: deterministic
math-verifyparser for numerical/set/variable answers, LLM majority-vote grading for descriptive answers
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
License
MIT.
Tasks
There is one split in this environment:
- test: 50 tasks
Each task is one original mathematical problem spanning five categories: Functions & Sequences, Combinatorics, Algebraic Equations & Inequalities, Number Theory, and Geometry. Problems are distributed across four answer types:
- Number: numerical answers
- Description: descriptive/prose answers
- Set: set-valued answers
- Variable: algebraic/symbolic answers
Reward Structure
This is a single-turn environment. The agent submits a final answer via the answer tool. Reward is binary: 1.0 if correct, 0.0 otherwise. Grading uses a hybrid approach depending on answer type:
- Number/Set types: Deterministic grading via the
math-verifylibrary, which parses and verifies mathematical expressions with float rounding to 4 decimal places. - Variable types: Deterministic grading via
math-verifycombined with SymPy'ssolve()to verify algebraic equivalence by substituting test values. - Description types: LLM-based grading using
o4-miniwith majority vote over 5 judge responses. Each judge determines whether the submitted answer is semantically equivalent to the reference answer.
For non-description types, the environment also attempts a truncated version of the extracted answer as a fallback, and accepts either match.
Data
Data is loaded from HuggingFace meituan-longcat/AMO-Bench at module import time using the datasets library. Each row contains a question ID, problem prompt in Markdown with LaTeX, expert solution, reference answer (often in \boxed{} format), and answer type.
Tools
| Tool | Description |
|---|---|
answer | Submit your final answer to the mathematical problem. Ends the episode. |
Time Horizon
Single-turn. The agent receives one problem and submits one answer.
Environment Difficulty
AMO-Bench causes a significant accuracy drop compared to existing math benchmarks like AIME, with most models scoring below 40%.
| Model | AMO-Bench Accuracy |
|---|---|
| Qwen3-Max-Thinking | 65.1% |
| Gemini 3 Pro | 63.1% |
| GLM-4.7 | 62.4% |
| Kimi-K2-Thinking | 56.0% |
| GPT-5-Thinking (High) | 52.4% |
| Qwen3-235B-A22B-Thinking | 47.8% |
| DeepSeek-V3.1-Thinking | 47.6% |
| o4-mini (High) | 40.2% |
Other Environment Requirements
- OpenAI API key: Required for LLM-based grading of description-type answers and used by the
o4-minijudge. Pass viasecrets={"openai_api_key": "..."}.
Safety
Agents in AMO-Bench answer mathematical problems in a standard environment. The environment does not present direct safety risks.
Citation
@article{an2025amobench,
title={AMO-Bench: Large Language Models Still Struggle in High School Math Competitions},
author={An, Shengnan and Cai, Xunliang and Cao, Xuezhi and Li, Xiaoyu and Lin, Yehao and Liu, Junlin and Lv, Xinxuan and Ma, Dan and Wang, Xuanlin and Wang, Ziwen and Zhou, Shuang},
journal={arXiv preprint arXiv:2510.26768},
year={2025},
url={https://arxiv.org/abs/2510.26768}
}