API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/procbench

README

ProcBench

Description

ProcBench is a multi-step procedural reasoning benchmark that evaluates whether language models can follow explicit step-by-step procedures to produce correct output. The benchmark covers 23 procedural task types -- including string manipulation, list operations, and arithmetic -- with 240 examples each, for a total of 5,520 tasks.

Capabilities

Multi-step procedural reasoning
Instruction following
Algorithmic execution

Compute Requirements

No sandbox is needed. ProcBench uses standard defaults.

License

CC-BY-4.0.

Tasks

There are 5,520 tasks in a single test split, spanning 23 procedural task types (task01 through task23) with 240 examples each. Each task provides detailed procedural instructions, an initial input state, expected intermediate states, and a final expected answer.

Reward Structure

Binary reward: 1.0 for correct, 0.0 for incorrect. Answers are graded by an LLM (gpt-5-mini) that compares the submitted answer to the expected answer, accounting for minor formatting differences such as spacing, capitalization, and equivalent numeric representations. On grader failure, the environment falls back to exact string match.

No temperature parameter is passed to the grader.

Data

The dataset (procbench_data.parquet, approximately 1.7 MB) is sourced from HuggingFace ifujisawa/procbench and stored on the OpenReward platform. Each record contains the fields: prompt, task_name, example_name, problem_name, init, final, and intermediate.

Tools

Tool	Parameters	Description
`submit_answer`	`answer: str`	Submit final answer after completing all procedural steps. Ends the episode.

Time Horizon

Single-turn. The agent receives the procedural prompt and submits one answer for a total of one tool call.

Environment Difficulty

The original paper evaluates frontier models on ProcBench (Prefix Accuracy / Sequential Match):

Model	PA	SM
o1-preview	69.8%	49.6%
o1-mini	64.1%	43.2%
GPT-4o	44.3%	27.8%
Claude-3.5-Sonnet	37.8%	23.0%
Mistral-Large	36.2%	22.9%
GPT-4o-mini	23.0%	9.3%
Gemini-1.5-Pro	22.4%	11.0%

Performance degrades significantly with task complexity. On longer tasks (17-25 steps), even o1-preview drops to 59.9% PA.

Other Environment Requirements

Requires an openai_api_key secret for gpt-5-mini answer evaluation.

Safety

No safety concerns. The agent processes algorithmic instructions and text. There is no access to external systems, sandboxes, or sensitive data.

Citations

@misc{fujisawa2024procbench,
  title={ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure},
  author={Ippei Fujisawa and Sensho Nobe and Hiroki Seto and Rina Onda and Yoshiaki Uchida and Hiroki Ikoma and Pei-Chun Chien and Ryota Kanai},
  year={2024},
  eprint={2410.03117},
  archivePrefix={arXiv},
  primaryClass={cs.AI}
}

Repository

Source repository

EnvCommons/ProcBench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

ProcBench

GeneralReasoning/ProcBench

ProcBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples