AMO-Bench

Description

AMO-Bench (Advanced Mathematical Olympiad Benchmark) is an environment for evaluating mathematical reasoning at International Mathematical Olympiad (IMO) difficulty level or higher. Based on the AMO-Bench benchmark by Meituan LongCat, the benchmark contains 50 original, expert-crafted problems designed to prevent data memorization and performance saturation seen in existing math benchmarks like AIME. All problems are cross-validated by experts and require only a final answer (not a proof), enabling automatic grading. Problems span four answer types: numerical, set, variable (algebraic), and description.

Capabilities

Advanced mathematical reasoning at IMO competition level
Final-answer-based evaluation enabling automatic grading
Hybrid grading: deterministic math-verify parser for numerical/set/variable answers, LLM majority-vote grading for descriptive answers

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT.

Tasks

There is one split in this environment:

test: 50 tasks

Each task is one original mathematical problem spanning five categories: Functions & Sequences, Combinatorics, Algebraic Equations & Inequalities, Number Theory, and Geometry. Problems are distributed across four answer types:

Number: numerical answers
Description: descriptive/prose answers
Set: set-valued answers
Variable: algebraic/symbolic answers

Reward Structure

This is a single-turn environment. The agent submits a final answer via the answer tool. Reward is binary: 1.0 if correct, 0.0 otherwise. Grading uses a hybrid approach depending on answer type:

Number/Set types: Deterministic grading via the math-verify library, which parses and verifies mathematical expressions with float rounding to 4 decimal places.
Variable types: Deterministic grading via math-verify combined with SymPy's solve() to verify algebraic equivalence by substituting test values.
Description types: LLM-based grading using o4-mini with majority vote over 5 judge responses. Each judge determines whether the submitted answer is semantically equivalent to the reference answer.

For non-description types, the environment also attempts a truncated version of the extracted answer as a fallback, and accepts either match.

Data

Data is loaded from HuggingFace meituan-longcat/AMO-Bench at module import time using the datasets library. Each row contains a question ID, problem prompt in Markdown with LaTeX, expert solution, reference answer (often in \boxed{} format), and answer type.

Tools

Tool	Description
`answer`	Submit your final answer to the mathematical problem. Ends the episode.

Time Horizon

Single-turn. The agent receives one problem and submits one answer.

Environment Difficulty

AMO-Bench causes a significant accuracy drop compared to existing math benchmarks like AIME, with most models scoring below 40%.

Model	AMO-Bench Accuracy
Qwen3-Max-Thinking	65.1%
Gemini 3 Pro	63.1%
GLM-4.7	62.4%
Kimi-K2-Thinking	56.0%
GPT-5-Thinking (High)	52.4%
Qwen3-235B-A22B-Thinking	47.8%
DeepSeek-V3.1-Thinking	47.6%
o4-mini (High)	40.2%

Other Environment Requirements

OpenAI API key: Required for LLM-based grading of description-type answers and used by the o4-mini judge. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in AMO-Bench answer mathematical problems in a standard environment. The environment does not present direct safety risks.

Citation

@article{an2025amobench,
  title={AMO-Bench: Large Language Models Still Struggle in High School Math Competitions},
  author={An, Shengnan and Cai, Xunliang and Cao, Xuezhi and Li, Xiaoyu and Lin, Yehao and Liu, Junlin and Lv, Xinxuan and Ma, Dan and Wang, Xuanlin and Wang, Ziwen and Zhou, Shuang},
  journal={arXiv preprint arXiv:2510.26768},
  year={2025},
  url={https://arxiv.org/abs/2510.26768}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

AMO-Bench

GeneralReasoning/AMO-Bench

AMO-Bench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Tools

Compute Configuration

Estimated Cost

Examples