API Endpoint

Leaderboard

Loading leaderboard...

README

PrincipiaBench

Description

PrincipiaBench is an evaluation environment for STEM mathematical derivation. Agents solve 2,241 curated problems sourced from SuperGPQA, RealMath, Physics, and ARB, and submit symbolic/mathematical answers graded by a majority-vote LLM equivalence judge.

Capabilities

Mathematical derivation and symbolic reasoning
Physics problem solving
STEM knowledge across multiple domains

Compute Requirements

Requires OpenAI API access for LLM-based equivalence judging (3 calls per submission for majority voting).

License

See the facebook/principia-bench dataset card for license details.

Tasks

test: 2,241 curated evaluation problems
Sources: SuperGPQA (1,452), RealMath (632), Physics (110), ARB (47)
Each task has: id, problem_statement, source_data

Reward Structure

Binary reward (0.0 or 1.0). Answers are graded by majority vote of 3 independent LLM equivalence judge calls comparing the candidate answer to the ground truth.

Data

Source: facebook/principia-bench on HuggingFace. 2,241 problems stored as a single parquet file. Mounted at /orwd_data in production.

Tools

submit(answer: str) — Submit a mathematical answer for grading. Ends the episode.

Time Horizon

Single-turn. One tool call per episode.

Environment Difficulty

Olympiad to graduate-level STEM problems across physics, mathematics, and related domains.

Other Environment Requirements

OpenAI API key.

Safety

No safety concerns — environment grades mathematical derivations only.

Citations

@misc{aggarwal2026reasoningmathematicalobjects,
      title={Reasoning over mathematical objects: on-policy reward modeling and test time aggregation},
      author={Pranjal Aggarwal and Marjan Ghazvininejad and Seungone Kim and Ilia Kulikov and Jack Lanchantin and Xian Li and Tianjian Li and Bo Liu and Graham Neubig and Anaelia Ovalle and Swarnadeep Saha and Sainbayar Sukhbaatar and Sean Welleck and Jason Weston and Chenxi Whitehouse and Adina Williams and Jing Xu and Ping Yu and Weizhe Yuan and Jingyu Zhang and Wenting Zhao},
      year={2026},
      eprint={2603.18886},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
}

Repository

Source repository

EnvCommons/PrincipiaBench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

PrincipiaBench

GeneralReasoning/PrincipiaBench

PrincipiaBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples