AIME2026
AIME2026
Description
AIME2026 is an environment for evaluating mathematical reasoning on 30 problems from the American Invitational Mathematics Examination 2026. AIME is a prestigious invitational competition for high school students who scored in the top 2.5% on the AMC 10/12. Problems cover algebra, geometry, number theory, combinatorics, and calculus, with integer answers validated via symbolic mathematical equivalence.
Capabilities
- High school competition mathematics at AIME difficulty
- Integer answer validation with symbolic equivalence checking
- Coverage of algebra, geometry, number theory, combinatorics, and calculus
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
Tasks
There is one split in this environment:
- test: 30 tasks
Each problem requires an integer answer (typical range: 0-999).
Reward Structure
Single-turn evaluation with deterministic grading. The agent submits an answer via the answer tool. The answer is verified using the math_verify library for symbolic mathematical equivalence. Reward is 1.0 if correct, 0.0 if incorrect.
Data
aime_2026_problems.parquet (30 problems). Stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
answer | Submit an integer answer. Evaluated via symbolic equivalence checking. Ends the episode. |
Time Horizon
Single-turn. The agent reads the problem and submits one answer.
Environment Difficulty
AIME 2026 represents standard AIME difficulty. MathArena evaluates frontier models:
| Model | Accuracy |
|---|---|
| Step 3.5 Flash | 96.7% |
| Kimi K2.5 | 95.8% |
| GLM 5 | 95.8% |
| DeepSeek-V3.2 | 94.2% |
| Qwen3.5-397B-A17B | 93.3% |
Top reasoning models now achieve near-perfect scores on AIME-level problems.
Other Environment Requirements
There are no further environment requirements; AIME2026 works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in AIME2026 solve competition mathematics problems in a standard environment. The environment does not present direct safety risks.