PHYBench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

PHYBench

OpenReward Environment Hugging Face Dataset

Description

PHYBench is an environment for evaluating physical perception and robust reasoning in language models. It contains 500 original physics problems spanning mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics. Problems range from high school to Physics Olympiad difficulty and require deriving symbolic LaTeX expressions. Evaluation uses Expression Edit Distance (EED), a continuous scoring metric that provides partial credit based on expression tree similarity.

Capabilities

  • Physics problem solving requiring symbolic derivation
  • Expression Edit Distance scoring for partial credit
  • Coverage of six physics domains at competition difficulty
  • Symbolic LaTeX answer validation

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT.

Tasks

There is one split in this environment:

  • test: 500 tasks

Tasks span six physics categories:

CategoryDescription
MECHANICSClassical motion, forces, oscillations
ELECTRICITYElectromagnetic fields, circuits, potentials
OPTICSWave phenomena, interference, diffraction
THERMODYNAMICSHeat, entropy, phase transitions
MODERNRelativistic effects, quantum concepts
ADVANCEDComplex multidisciplinary problems

Reward Structure

Single-turn evaluation with Expression Edit Distance (EED) scoring. The agent submits a symbolic LaTeX answer via the submit_answer tool. The EED score (0-100) measures expression tree similarity:

  • 100: Symbolically equivalent to ground truth
  • 60-100: Minor errors (e.g., coefficient mistakes)
  • <30: Major structural errors

The reward is the EED score normalized to 0.0-1.0.

Data

phybench_data.parquet (500 problems) sourced from HuggingFace Eureka-Lab/PHYBench. Stored on the OpenReward platform.

Tools

ToolDescription
submit_answerSubmit a symbolic LaTeX expression. Evaluated via EED scoring with partial credit. Ends the episode.

Time Horizon

Single-turn. The agent reads the physics problem and submits one symbolic answer.

Environment Difficulty

PHYBench evaluates physical perception and robust reasoning. Models significantly underperform compared to human experts:

ModelAccuracyEED Score
Human Baseline (PKU students)61.9%70.4
Gemini 2.5 Pro36.9%~50
DeepSeek-R1~30%~45
o3-mini (high)~30%~45
Other SOTA LLMs20-35%30-45

The 25+ percentage point gap between human experts and the best AI models demonstrates the benchmark's difficulty.

Other Environment Requirements

There are no further environment requirements; PHYBench works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in PHYBench solve physics derivation problems in a standard environment. The environment does not present direct safety risks.

Citation

@article{qiu2025phybench,
  title={PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models},
  author={Qiu, Shi and Guo, Shaoyang and Song, Zhuo-Yang and Sun, Yunbo and Cai, Zeyu and Wei, Jiashen and Luo, Tianyu and Yin, Yixuan and Zhang, Haoxu and Hu, Yi and Wang, Chenyang and Tang, Chencheng and Chang, Haoling and Liu, Qi and Zhou, Ziheng and Zhang, Tianyu and Zhang, Jingtian and Liu, Zhangyi and Li, Minghao and Zhang, Yuku and Jing, Boxuan and Yin, Xianqi and Ren, Yutong and Fu, Zizhuo and Ji, Jiaming and Wang, Weike and Tian, Xudong and Lv, Anqi and Man, Laifu and Li, Jianxiang and Tao, Feiyu and Sun, Qihua and Liang, Zhou and Mu, Yushu and Li, Zhongxuan and Zhang, Jing-Jun and Zhang, Shutao and Li, Xiaotian and Xia, Xingqi and Lin, Jiawei and Shen, Zheyu and Chen, Jiahang and Xiong, Qiuhao and Wang, Binran and Wang, Fengyuan and Ni, Ziyang and Zhang, Bohan and Cui, Fan and Shao, Changkun and Cao, Qing-Hong and Luo, Ming-xing and Yang, Yaodong and Zhang, Muhan and Zhu, Hua Xing},
  journal={arXiv preprint arXiv:2504.16074},
  year={2025}
}
RISE-AGI/PHYBench | OpenReward