phybench

Description

PHYBench is a benchmark of 500 original physics problems ranging from high school to Physics Olympiad difficulty designed to evaluate LLM multi-step and multi-condition reasoning while avoiding data contamination through original content and a systematic curation pipeline. It also introduces the Expression Edit Distance (EED) Score for mathematical expression assessment to improve evaluation precision and sample efficiency and provides stronger differentiation of model reasoning performance compared to prior baselines.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
RISE-AGIRISE-AGI/PHYBench
1
2 months ago
arXiv/phybench | OpenReward