AIME2025
AIME2025
Description
AIME2025 is an environment for evaluating mathematical reasoning on problems from the 2025 American Invitational Mathematics Examination (AIME). The agent receives an AIME competition problem and must solve it, then submit a final integer answer. Answers are graded deterministically using the math-verify library.
Capabilities
- Advanced mathematical reasoning and problem solving
- Competition-level number theory, algebra, combinatorics, and geometry
- Producing and verifying exact numerical answers
Compute Requirements
AIME2025 is a single-turn environment with no sandbox or file system requirements. The agent only needs to reason about the problem and submit an answer.
Tasks
There are 30 tasks in a single test split, consisting of:
- 15 problems from AIME I 2025
- 15 problems from AIME II 2025
Each task presents one AIME problem. The agent receives the problem statement and must submit a single integer answer (AIME answers are always integers between 000 and 999).
Reward Structure
This is a sparse, verifiable reward environment. The agent receives a single reward at the end of the episode when it calls the answer tool:
- 1.0 if the submitted answer is correct.
- 0.0 if the submitted answer is incorrect.
Grading is deterministic. The submitted answer is parsed and compared against the gold answer using the math-verify library. No LLM grader is used.
Data
The 30 AIME 2025 problems are sourced from the HuggingFace dataset yentinglin/aime_2025. Each record contains a problem statement and its corresponding integer answer. The data is stored as a Parquet file (aime_2025_problems.parquet) downloaded at build time.
Tools
AIME2025 provides a single tool:
| Tool | Description |
|---|---|
answer | Submit a final integer answer to the problem. Accepts a string answer parameter. This call ends the episode and returns the reward. |
Time Horizon
AIME2025 is a single-turn environment. The agent reads the problem, reasons about it, and submits one answer via the answer tool. There is exactly one tool call per episode.
Environment Difficulty
| Model | Accuracy |
|---|---|
| Claude Sonnet 4.5 (python) | 100% |
| Gemini 3 Pro (with code execution) | 100% |
| GPT-5.2 Thinking (no tools) | 100% |
| Grok 4 Heavy (with python) | 100% |
| Step 3.5 Flash (parallel thinking) | 99.9% |
| Claude Opus 4.6 | 99.79% |
| Grok 4 (with python) | 98.8% |
Other Environment Requirements
There are no further environment requirements; AIME2025 works out of the box with the OpenReward endpoint without any external API keys.
Safety
AIME2025 poses no direct safety risks. The agent interacts only with mathematical problem statements and submits numerical answers. There is no access to external systems, no file system interaction, and no opportunity for harmful actions.