PHYBench
PHYBench
Description
PHYBench is an environment for evaluating physical perception and robust reasoning in language models. It contains 500 original physics problems spanning mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics. Problems range from high school to Physics Olympiad difficulty and require deriving symbolic LaTeX expressions. Evaluation uses Expression Edit Distance (EED), a continuous scoring metric that provides partial credit based on expression tree similarity.
Capabilities
- Physics problem solving requiring symbolic derivation
- Expression Edit Distance scoring for partial credit
- Coverage of six physics domains at competition difficulty
- Symbolic LaTeX answer validation
Compute Requirements
Agents are given a standard environment with no sandbox or file system access.
License
MIT.
Tasks
There is one split in this environment:
- test: 500 tasks
Tasks span six physics categories:
| Category | Description |
|---|---|
| MECHANICS | Classical motion, forces, oscillations |
| ELECTRICITY | Electromagnetic fields, circuits, potentials |
| OPTICS | Wave phenomena, interference, diffraction |
| THERMODYNAMICS | Heat, entropy, phase transitions |
| MODERN | Relativistic effects, quantum concepts |
| ADVANCED | Complex multidisciplinary problems |
Reward Structure
Single-turn evaluation with Expression Edit Distance (EED) scoring. The agent submits a symbolic LaTeX answer via the submit_answer tool. The EED score (0-100) measures expression tree similarity:
- 100: Symbolically equivalent to ground truth
- 60-100: Minor errors (e.g., coefficient mistakes)
- <30: Major structural errors
The reward is the EED score normalized to 0.0-1.0.
Data
phybench_data.parquet (500 problems) sourced from HuggingFace Eureka-Lab/PHYBench. Stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
submit_answer | Submit a symbolic LaTeX expression. Evaluated via EED scoring with partial credit. Ends the episode. |
Time Horizon
Single-turn. The agent reads the physics problem and submits one symbolic answer.
Environment Difficulty
PHYBench evaluates physical perception and robust reasoning. Models significantly underperform compared to human experts:
| Model | Accuracy | EED Score |
|---|---|---|
| Human Baseline (PKU students) | 61.9% | 70.4 |
| Gemini 2.5 Pro | 36.9% | ~50 |
| DeepSeek-R1 | ~30% | ~45 |
| o3-mini (high) | ~30% | ~45 |
| Other SOTA LLMs | 20-35% | 30-45 |
The 25+ percentage point gap between human experts and the best AI models demonstrates the benchmark's difficulty.
Other Environment Requirements
There are no further environment requirements; PHYBench works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in PHYBench solve physics derivation problems in a standard environment. The environment does not present direct safety risks.
Citation
@article{qiu2025phybench,
title={PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models},
author={Qiu, Shi and Guo, Shaoyang and Song, Zhuo-Yang and Sun, Yunbo and Cai, Zeyu and Wei, Jiashen and Luo, Tianyu and Yin, Yixuan and Zhang, Haoxu and Hu, Yi and Wang, Chenyang and Tang, Chencheng and Chang, Haoling and Liu, Qi and Zhou, Ziheng and Zhang, Tianyu and Zhang, Jingtian and Liu, Zhangyi and Li, Minghao and Zhang, Yuku and Jing, Boxuan and Yin, Xianqi and Ren, Yutong and Fu, Zizhuo and Ji, Jiaming and Wang, Weike and Tian, Xudong and Lv, Anqi and Man, Laifu and Li, Jianxiang and Tao, Feiyu and Sun, Qihua and Liang, Zhou and Mu, Yushu and Li, Zhongxuan and Zhang, Jing-Jun and Zhang, Shutao and Li, Xiaotian and Xia, Xingqi and Lin, Jiawei and Shen, Zheyu and Chen, Jiahang and Xiong, Qiuhao and Wang, Binran and Wang, Fengyuan and Ni, Ziyang and Zhang, Bohan and Cui, Fan and Shao, Changkun and Cao, Qing-Hong and Luo, Ming-xing and Yang, Yaodong and Zhang, Muhan and Zhu, Hua Xing},
journal={arXiv preprint arXiv:2504.16074},
year={2025}
}