GSM8K
GSM8K
Description
GSM8K is an environment for evaluating grade school math word problem solving. Based on OpenAI's GSM8K benchmark, agents are given math word problems requiring 2-8 steps of basic arithmetic and must provide the final numerical answer.
Capabilities
- Multi-step arithmetic reasoning
- Grade school math problem solving
- Numerical answer extraction
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
License
MIT.
Tasks
There are two splits in this environment:
- train: 7,473 grade school math word problems.
- test: 1,319 grade school math word problems.
Total: 8,792 tasks. Each task presents a math word problem requiring 2-8 steps of basic arithmetic (addition, subtraction, multiplication, division) to solve.
Reward Structure
This is a single-turn, verifiable reward environment. The agent submits its answer via the answer tool. The answer is verified using the math_verify library for mathematical equivalence against the gold answer. The reward is binary: 1.0 if the answer is correct, 0.0 if incorrect.
We do not use LLM graders for this task.
Data
Data consists of Parquet files (train-00000-of-00001.parquet, test-00000-of-00001.parquet) sourced from the openai/gsm8k HuggingFace dataset. Each record contains a question (the math word problem) and an answer (the gold solution with a final numerical answer). Data files are stored on the OpenReward platform.
Tools
Agents have access to a single tool:
- answer -- Submit a final numerical answer. The answer is checked for mathematical equivalence against the gold answer using the
math_verifylibrary. This tool finishes the episode.
Time Horizon
Single-turn. The agent reads the math problem and submits one answer.
Environment Difficulty
| Model | Accuracy |
|---|---|
| GPT-4 (DUP) | 97.1% |
| Llama 3 405B | 96.8% |
| Claude 3.5 Sonnet | 96.4% |
| GPT-4o | 96.1% |
| Llama 3 70B | 95.1% |
Other Environment Requirements
There are no further environment requirements; GSM8K works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in GSM8K solve grade school math problems in a standard environment. The environment does not present direct safety risks.
Citation
@article{cobbe2021gsm8k,
title={Training Verifiers to Solve Math Word Problems},
author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John},
journal={arXiv preprint arXiv:2110.14168},
year={2021}
}