API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/phybench

README

PHYBench

Description

PHYBench is an environment for evaluating physical perception and robust reasoning in language models. It contains 500 original physics problems spanning mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics. Problems range from high school to Physics Olympiad difficulty and require deriving symbolic LaTeX expressions. Evaluation uses Expression Edit Distance (EED), a continuous scoring metric that provides partial credit based on expression tree similarity.

Capabilities

Physics problem solving requiring symbolic derivation
Expression Edit Distance scoring for partial credit
Coverage of six physics domains at competition difficulty
Symbolic LaTeX answer validation

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT.

Tasks

There is one split in this environment:

test: 500 tasks

Tasks span six physics categories:

Category	Description
MECHANICS	Classical motion, forces, oscillations
ELECTRICITY	Electromagnetic fields, circuits, potentials
OPTICS	Wave phenomena, interference, diffraction
THERMODYNAMICS	Heat, entropy, phase transitions
MODERN	Relativistic effects, quantum concepts
ADVANCED	Complex multidisciplinary problems

Reward Structure

Single-turn evaluation with Expression Edit Distance (EED) scoring. The agent submits a symbolic LaTeX answer via the submit_answer tool. The EED score (0-100) measures expression tree similarity:

100: Symbolically equivalent to ground truth
60-100: Minor errors (e.g., coefficient mistakes)
<30: Major structural errors

The reward is the EED score normalized to 0.0-1.0.

Data

phybench_data.parquet (500 problems) sourced from HuggingFace Eureka-Lab/PHYBench. Stored on the OpenReward platform.

Tools

Tool	Description
`submit_answer`	Submit a symbolic LaTeX expression. Evaluated via EED scoring with partial credit. Ends the episode.

Time Horizon

Single-turn. The agent reads the physics problem and submits one symbolic answer.

Environment Difficulty

PHYBench evaluates physical perception and robust reasoning. Models significantly underperform compared to human experts:

Model	Accuracy	EED Score
Human Baseline (PKU students)	61.9%	70.4
Gemini 2.5 Pro	36.9%	~50
DeepSeek-R1	~30%	~45
o3-mini (high)	~30%	~45
Other SOTA LLMs	20-35%	30-45

The 25+ percentage point gap between human experts and the best AI models demonstrates the benchmark's difficulty.

Other Environment Requirements

There are no further environment requirements; PHYBench works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in PHYBench solve physics derivation problems in a standard environment. The environment does not present direct safety risks.

Citation

@article{qiu2025phybench,
  title={PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models},
  author={Qiu, Shi and Guo, Shaoyang and Song, Zhuo-Yang and Sun, Yunbo and Cai, Zeyu and Wei, Jiashen and Luo, Tianyu and Yin, Yixuan and Zhang, Haoxu and Hu, Yi and Wang, Chenyang and Tang, Chencheng and Chang, Haoling and Liu, Qi and Zhou, Ziheng and Zhang, Tianyu and Zhang, Jingtian and Liu, Zhangyi and Li, Minghao and Zhang, Yuku and Jing, Boxuan and Yin, Xianqi and Ren, Yutong and Fu, Zizhuo and Ji, Jiaming and Wang, Weike and Tian, Xudong and Lv, Anqi and Man, Laifu and Li, Jianxiang and Tao, Feiyu and Sun, Qihua and Liang, Zhou and Mu, Yushu and Li, Zhongxuan and Zhang, Jing-Jun and Zhang, Shutao and Li, Xiaotian and Xia, Xingqi and Lin, Jiawei and Shen, Zheyu and Chen, Jiahang and Xiong, Qiuhao and Wang, Binran and Wang, Fengyuan and Ni, Ziyang and Zhang, Bohan and Cui, Fan and Shao, Changkun and Cao, Qing-Hong and Luo, Ming-xing and Yang, Yaodong and Zhang, Muhan and Zhu, Hua Xing},
  journal={arXiv preprint arXiv:2504.16074},
  year={2025}
}

Repository

Source repository

EnvCommons/PHYBench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

PHYBench

RISE-AGI/PHYBench

PHYBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples