PrincipiaBench
PrincipiaBench
Description
PrincipiaBench is an evaluation environment for STEM mathematical derivation. Agents solve 2,241 curated problems sourced from SuperGPQA, RealMath, Physics, and ARB, and submit symbolic/mathematical answers graded by a majority-vote LLM equivalence judge.
Capabilities
- Mathematical derivation and symbolic reasoning
- Physics problem solving
- STEM knowledge across multiple domains
Compute Requirements
Requires OpenAI API access for LLM-based equivalence judging (3 calls per submission for majority voting).
License
See the facebook/principia-bench dataset card for license details.
Tasks
- test: 2,241 curated evaluation problems
- Sources: SuperGPQA (1,452), RealMath (632), Physics (110), ARB (47)
- Each task has:
id,problem_statement,source_data
Reward Structure
Binary reward (0.0 or 1.0). Answers are graded by majority vote of 3 independent LLM equivalence judge calls comparing the candidate answer to the ground truth.
Data
Source: facebook/principia-bench on HuggingFace. 2,241 problems stored as a single parquet file. Mounted at /orwd_data in production.
Tools
submit(answer: str)— Submit a mathematical answer for grading. Ends the episode.
Time Horizon
Single-turn. One tool call per episode.
Environment Difficulty
Olympiad to graduate-level STEM problems across physics, mathematics, and related domains.
Other Environment Requirements
- OpenAI API key.
Safety
No safety concerns — environment grades mathematical derivations only.
Citations
@misc{aggarwal2026reasoningmathematicalobjects,
title={Reasoning over mathematical objects: on-policy reward modeling and test time aggregation},
author={Pranjal Aggarwal and Marjan Ghazvininejad and Seungone Kim and Ilia Kulikov and Jack Lanchantin and Xian Li and Tianjian Li and Bo Liu and Graham Neubig and Anaelia Ovalle and Swarnadeep Saha and Sainbayar Sukhbaatar and Sean Welleck and Jason Weston and Chenxi Whitehouse and Adina Williams and Jing Xu and Ping Yu and Weizhe Yuan and Jingyu Zhang and Wenting Zhao},
year={2026},
eprint={2603.18886},
archivePrefix={arXiv},
primaryClass={cs.AI},
}