ProcBench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

ProcBench

⭐ OpenReward Environment Hugging Face Dataset

Description

ProcBench is a multi-step procedural reasoning benchmark that evaluates whether language models can follow explicit step-by-step procedures to produce correct output. The benchmark covers 23 procedural task types -- including string manipulation, list operations, and arithmetic -- with 240 examples each, for a total of 5,520 tasks.

Capabilities

  • Multi-step procedural reasoning
  • Instruction following
  • Algorithmic execution

Compute Requirements

No sandbox is needed. ProcBench uses standard defaults.

License

CC-BY-4.0.

Tasks

There are 5,520 tasks in a single test split, spanning 23 procedural task types (task01 through task23) with 240 examples each. Each task provides detailed procedural instructions, an initial input state, expected intermediate states, and a final expected answer.

Reward Structure

Binary reward: 1.0 for correct, 0.0 for incorrect. Answers are graded by an LLM (gpt-5-mini) that compares the submitted answer to the expected answer, accounting for minor formatting differences such as spacing, capitalization, and equivalent numeric representations. On grader failure, the environment falls back to exact string match.

No temperature parameter is passed to the grader.

Data

The dataset (procbench_data.parquet, approximately 1.7 MB) is sourced from HuggingFace ifujisawa/procbench and stored on the OpenReward platform. Each record contains the fields: prompt, task_name, example_name, problem_name, init, final, and intermediate.

Tools

ToolParametersDescription
submit_answeranswer: strSubmit final answer after completing all procedural steps. Ends the episode.

Time Horizon

Single-turn. The agent receives the procedural prompt and submits one answer for a total of one tool call.

Environment Difficulty

The original paper evaluates frontier models on ProcBench (Prefix Accuracy / Sequential Match):

ModelPASM
o1-preview69.8%49.6%
o1-mini64.1%43.2%
GPT-4o44.3%27.8%
Claude-3.5-Sonnet37.8%23.0%
Mistral-Large36.2%22.9%
GPT-4o-mini23.0%9.3%
Gemini-1.5-Pro22.4%11.0%

Performance degrades significantly with task complexity. On longer tasks (17-25 steps), even o1-preview drops to 59.9% PA.

Other Environment Requirements

Requires an openai_api_key secret for gpt-5-mini answer evaluation.

Safety

No safety concerns. The agent processes algorithmic instructions and text. There is no access to external systems, sandboxes, or sensitive data.

Citations

@misc{fujisawa2024procbench,
  title={ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure},
  author={Ippei Fujisawa and Sensho Nobe and Hiroki Seto and Rina Onda and Yoshiaki Uchida and Hiroki Ikoma and Pei-Chun Chien and Ryota Kanai},
  year={2024},
  eprint={2410.03117},
  archivePrefix={arXiv},
  primaryClass={cs.AI}
}
GeneralReasoning/ProcBench | OpenReward