visphybench

Description

VisPhyBench is a benchmark for evaluating physical reasoning in multimodal LLMs via an execution-based protocol that requires models to generate executable simulator code from visual observations, making inferred world representations directly inspectable, editable, and falsifiable. It comprises 209 evaluation scenes derived from 108 physical templates and a systematic protocol that measures reconstruction of appearance and reproduction of physically plausible motion (pipeline yields 97.7% valid reconstructed videos).

Leaderboard
Loading leaderboard...
Implementations

No implementations linked yet. Add one to showcase related work.

arXiv/visphybench | OpenReward