DAPO-Math

API Endpoint
Leaderboard
Loading leaderboard...
README

DAPO-Math

OpenReward Environment Hugging Face Dataset

Description

DAPO-Math is an environment for evaluating mathematical reasoning on competition-level problems from the DAPO-Math-17k dataset. The dataset was curated by ByteDance Seed and Tsinghua AIR as the training set for DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization), an open-source reinforcement learning system for large language models. Problems span algebra, geometry, number theory, and combinatorics, with integer ground-truth answers verified via rule-based matching.

Capabilities

  • Solving competition-level mathematics problems
  • Step-by-step mathematical reasoning
  • Producing precise numerical answers

Compute Requirements

Minimal. No sandbox or code execution is used. The environment runs rule-based answer verification only.

License

Apache 2.0, matching the original dataset license.

Tasks

There are approximately 14,100 tasks in a single train split (English subset of the deduplicated DAPO-Math-17k dataset). Each task is a competition-level math problem with an integer answer.

Reward Structure

Binary reward based on rule-based answer verification using the math_verify library (style: rule-lighteval/MATH_v2):

  • 1.0 if the submitted answer is mathematically equivalent to the ground truth
  • 0.0 otherwise

No LLM grader is used.

Data

Problems are sourced from the DAPO-Math-17k dataset, using the deduplicated English subset provided by open-r1/DAPO-Math-17k-Processed. The dataset is loaded from HuggingFace at server startup.

Tools

ToolDescription
answerSubmit a final answer for evaluation against the ground truth.

Time Horizon

Single-turn. The agent receives a math problem and submits one answer.

Environment Difficulty

These are competition-level math problems (olympiad-style). The DAPO system trained on this dataset achieved 50 points on AIME 2024 using Qwen2.5-32B.

Other Environment Requirements

No external API keys or secrets are required. Grading is entirely rule-based.

Safety

This environment poses minimal safety risk. Agents solve self-contained math problems with no access to external systems, file systems, or network resources.

Citations

@article{yu2025dapo,
  title={DAPO: An Open-Source LLM Reinforcement Learning System at Scale},
  author={Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and Liu, Xin and Lin, Haibin and Lin, Zhiqi and Ma, Bole and Sheng, Guangming and Tong, Yuxuan and Zhang, Chi and Zhang, Mofan and Zhang, Wang and Zhu, Hang and Zhu, Jinhua and Chen, Jiaze and Chen, Jiangjie and Wang, Chengyi and Yu, Hongli and Dai, Weinan and Song, Yuxuan and Wei, Xiangpeng and Zhou, Hao and Liu, Jingjing and Ma, Wei-Ying and Zhang, Ya-Qin and Yan, Lin and Qiao, Mu and Wu, Yonghui and Wang, Mingxuan},
  journal={arXiv preprint arXiv:2503.14476},
  year={2025}
}
GeneralReasoning/DAPO-Math | OpenReward