API Endpoint

Leaderboard

Loading leaderboard...

README

GSM8K

Description

GSM8K is an environment for evaluating grade school math word problem solving. Based on OpenAI's GSM8K benchmark, agents are given math word problems requiring 2-8 steps of basic arithmetic and must provide the final numerical answer.

Capabilities

Multi-step arithmetic reasoning
Grade school math problem solving
Numerical answer extraction

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT.

Tasks

There are two splits in this environment:

train: 7,473 grade school math word problems.
test: 1,319 grade school math word problems.

Total: 8,792 tasks. Each task presents a math word problem requiring 2-8 steps of basic arithmetic (addition, subtraction, multiplication, division) to solve.

Reward Structure

This is a single-turn, verifiable reward environment. The agent submits its answer via the answer tool. The answer is verified using the math_verify library for mathematical equivalence against the gold answer. The reward is binary: 1.0 if the answer is correct, 0.0 if incorrect.

We do not use LLM graders for this task.

Data

Data consists of Parquet files (train-00000-of-00001.parquet, test-00000-of-00001.parquet) sourced from the openai/gsm8k HuggingFace dataset. Each record contains a question (the math word problem) and an answer (the gold solution with a final numerical answer). Data files are stored on the OpenReward platform.

Tools

Agents have access to a single tool:

answer -- Submit a final numerical answer. The answer is checked for mathematical equivalence against the gold answer using the math_verify library. This tool finishes the episode.

Time Horizon

Single-turn. The agent reads the math problem and submits one answer.

Environment Difficulty

Model	Accuracy
GPT-4 (DUP)	97.1%
Llama 3 405B	96.8%
Claude 3.5 Sonnet	96.4%
GPT-4o	96.1%
Llama 3 70B	95.1%

Other Environment Requirements

There are no further environment requirements; GSM8K works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in GSM8K solve grade school math problems in a standard environment. The environment does not present direct safety risks.

Citation

@article{cobbe2021gsm8k,
  title={Training Verifiers to Solve Math Word Problems},
  author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John},
  journal={arXiv preprint arXiv:2110.14168},
  year={2021}
}

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

GSM8K

GeneralReasoning/GSM8K

GSM8K

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Tools

Compute Configuration

Estimated Cost

Examples