ineqmath
IneqMath
Description
IneqMath is an environment for evaluating an agent's ability to prove mathematical inequalities. Based on an expert-curated dataset of Olympiad-level inequality problems, it tests advanced reasoning skills including discovering tight bounds, strategic theorem application, and constructing rigorous proofs. The benchmark recasts inequality proving into verifiable subtasks: bound estimation and relation prediction.
This OpenReward implementation is ported from the Harbor Framework implementation originally made by 1171-jpg.
Capabilities
- Proving mathematical inequalities at the Olympiad level
- Discovering tight bounds for expressions
- Applying mathematical theorems strategically (AM-GM, Cauchy-Schwarz, etc.)
- Constructing step-by-step rigorous proofs
Compute Requirements
Agents are given a sandboxed environment with bash access and file editing tools. Default sandbox size is 1 CPU and 2 GB RAM.
License
Tasks
There is one split in this environment:
- Test: 100 inequality proving tasks
Problems are commissioned from IMO-level medalists to ensure novelty and minimize prior LLM training exposure.
Reward Structure
This is a multi-turn, sandbox-based environment. The agent writes its answer to /app/answer.txt and calls submit_answer. Tasks are one of two types:
- Bound estimation: Find the largest/smallest constant C. An LLM judge (gpt-4o-mini) verifies mathematical equivalence with the ground truth.
- Relation prediction: Identify the correct relation (≤, ≥, =, <, >) between expressions. Verified by exact match.
Reward is 1.0 for correct answers, 0.0 otherwise.
Data
Each task directory contains an instruction.md with the inequality problem and a tests/ directory with verification scripts. Task data is stored on the OpenReward platform.
Source: AI4Math/IneqMath
Tools
| Tool | Description |
|---|---|
bash | Execute shell commands in the sandbox. |
str_replace | Replace a unique string in a file. |
view | View file contents or list directory contents. |
create_file | Create a new file with specified content. |
submit_answer | Submit work for automated verification. |
Time Horizon
IneqMath is a multi-turn environment. Agents read the problem, develop a proof strategy, implement and test their solution, and submit for verification.
Environment Difficulty
IneqMath is challenging. The original paper evaluates 29 LLMs and finds a dramatic gap between answer accuracy and reasoning soundness:
| Model | Overall Accuracy | Answer-Only Accuracy |
|---|---|---|
| o3 | 21.0% | 93.5% |
| o4-mini | 15.5% | 62.0% |
| o3-mini | 9.5% | - |
| o1 | 8.0% | 62.5% |
The drop of up to 65.5% from answer-only to overall accuracy reveals that while LLMs often find correct answers, their reasoning chains remain fragile under step-wise scrutiny.
Other Environment Requirements
There are no further environment requirements; IneqMath works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in IneqMath solve mathematical inequality problems in a sandboxed environment. The environment does not present direct safety risks.
Citations
@inproceedings{sheng2025ineqmath,
author = {Jiayi Sheng and Luna Lyu and Jikai Jin and Tony Xia and Alex Gu and James Zou and Pan Lu},
title = {Solving Inequality Proofs with Large Language Models},
booktitle = {NeurIPS 2025 Spotlight},
year = {2025},
url = {https://arxiv.org/abs/2506.07927}
}