AIME2024
AIME2024
Description
AIME2024 is an environment for evaluating mathematical reasoning on problems from the 2024 American Invitational Mathematics Examination (AIME). The AIME is a prestigious high school mathematics competition administered by the Mathematical Association of America (MAA), serving as the second stage of the AMC pathway toward selection for the International Mathematical Olympiad (IMO). Problems require deep mathematical reasoning across algebra, combinatorics, geometry, and number theory, with answers that are integers in the range 000--999.
Capabilities
- Solving competition-level mathematics problems
- Multi-step mathematical reasoning
- Algebraic manipulation and computation
- Combinatorial and geometric reasoning
Compute Requirements
AIME2024 is a lightweight, single-turn environment. The agent receives a problem, reasons about it, and submits a single integer answer. No sandbox or significant compute resources are required beyond the agent's own inference.
Tasks
There are 30 tasks in a single test split, consisting of:
- AIME I 2024: 15 problems (problems 0--14)
- AIME II 2024: 15 problems (problems 15--29)
Each task presents the agent with a single AIME problem statement. The agent must solve the problem and submit an integer answer using the answer tool.
Reward Structure
AIME2024 uses a binary, deterministic reward:
- 1.0 if the submitted answer is correct
- 0.0 if the submitted answer is incorrect
Grading is performed using the math-verify library, which parses and verifies mathematical expressions. No LLM grader is used; the reward is fully deterministic.
Data
The 30 AIME 2024 problems are sourced from the Maxwell-Jia/AIME_2024 dataset on Hugging Face. Each record contains the problem statement and the ground-truth integer answer. The data is stored as a Parquet file (aime_2024_problems.parquet) and downloaded at build time from HuggingFace.
Tools
AIME2024 exposes a single tool:
| Tool | Parameters | Description |
|---|---|---|
answer | answer: str | Submits the agent's final answer. The answer is parsed and verified against the ground truth using math-verify. This call ends the episode. |
Time Horizon
AIME2024 is a single-turn environment. The agent receives the problem in the prompt and is expected to make exactly one tool call (answer) to submit its solution. There is no multi-turn interaction.
Environment Difficulty
AIME problems are challenging competition mathematics requiring creative multi-step reasoning. Selected model scores on AIME 2024:
| Model | Score |
|---|---|
| GPT-5 pro (python) | 100 |
| o1 | 96 |
| o3 (no tools) | 91.6 |
| Qwen 3 Coder Next | 89.01 |
| o3-mini (high) | 87.3 |
| DeepSeek-R1 | 79.8 |
| o3-mini (medium) | 79.6 |
| OpenAI-o1-0912 | 74.4 |
| Magistral Medium | 73.6 |
| DeepSeek-R1-Zero | 71 |
Other Environment Requirements
There are no further environment requirements; AIME2024 works out of the box with the OpenReward endpoint without any external API keys.
Safety
AIME2024 is a purely mathematical evaluation environment. The agent solves well-defined competition problems with known correct answers and does not interact with external systems, APIs, or other agents. The environment presents no direct or indirect safety risks.
Citation
@dataset{GRAIME2024,
author = {General Reasoning Inc. Team},
title = {AIME2024},
year = {2026},
publisher = {OpenReward},
url = {https://openreward.ai/GeneralReasoning/AIME2024}
}