API Endpoint

Leaderboard

Loading leaderboard...

README

AIME2025

Description

AIME2025 is an environment for evaluating mathematical reasoning on problems from the 2025 American Invitational Mathematics Examination (AIME). The agent receives an AIME competition problem and must solve it, then submit a final integer answer. Answers are graded deterministically using the math-verify library.

Capabilities

Advanced mathematical reasoning and problem solving
Competition-level number theory, algebra, combinatorics, and geometry
Producing and verifying exact numerical answers

Compute Requirements

AIME2025 is a single-turn environment with no sandbox or file system requirements. The agent only needs to reason about the problem and submit an answer.

Tasks

There are 30 tasks in a single test split, consisting of:

15 problems from AIME I 2025
15 problems from AIME II 2025

Each task presents one AIME problem. The agent receives the problem statement and must submit a single integer answer (AIME answers are always integers between 000 and 999).

Reward Structure

This is a sparse, verifiable reward environment. The agent receives a single reward at the end of the episode when it calls the answer tool:

1.0 if the submitted answer is correct.
0.0 if the submitted answer is incorrect.

Grading is deterministic. The submitted answer is parsed and compared against the gold answer using the math-verify library. No LLM grader is used.

Data

The 30 AIME 2025 problems are sourced from the HuggingFace dataset yentinglin/aime_2025. Each record contains a problem statement and its corresponding integer answer. The data is stored as a Parquet file (aime_2025_problems.parquet) downloaded at build time.

Tools

AIME2025 provides a single tool:

Tool	Description
`answer`	Submit a final integer answer to the problem. Accepts a string `answer` parameter. This call ends the episode and returns the reward.

Time Horizon

AIME2025 is a single-turn environment. The agent reads the problem, reasons about it, and submits one answer via the answer tool. There is exactly one tool call per episode.

Environment Difficulty

Model	Accuracy
Claude Sonnet 4.5 (python)	100%
Gemini 3 Pro (with code execution)	100%
GPT-5.2 Thinking (no tools)	100%
Grok 4 Heavy (with python)	100%
Step 3.5 Flash (parallel thinking)	99.9%
Claude Opus 4.6	99.79%
Grok 4 (with python)	98.8%

Other Environment Requirements

There are no further environment requirements; AIME2025 works out of the box with the OpenReward endpoint without any external API keys.

Safety

AIME2025 poses no direct safety risks. The agent interacts only with mathematical problem statements and submits numerical answers. There is no access to external systems, no file system interaction, and no opportunity for harmful actions.

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

AIME2025

GeneralReasoning/AIME2025

AIME2025

Description

Capabilities

Compute Requirements

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Tools

Compute Configuration

Estimated Cost

Examples