vln-mme

Description

VLN-MME is a standardized benchmark for probing multimodal large language models as zero-shot embodied agents in Vision-and-Language Navigation by bridging traditional navigation datasets into a unified, extensible, and modular evaluation framework. It enables structured comparisons and component-level ablations across MLLM architectures and agent designs, and reveals that augmentations like Chain-of-Thought and self-reflection can degrade performance, exposing limited 3D spatial reasoning and sequential decision-making in current MLLMs.

Leaderboard
Loading leaderboard...
Implementations

No implementations linked yet. Add one to showcase related work.

arXiv/vln-mme | OpenReward