vln-mme
Description
VLN-MME is a standardized benchmark for probing multimodal large language models as zero-shot embodied agents in Vision-and-Language Navigation by bridging traditional navigation datasets into a unified, extensible, and modular evaluation framework. It enables structured comparisons and component-level ablations across MLLM architectures and agent designs, and reveals that augmentations like Chain-of-Thought and self-reflection can degrade performance, exposing limited 3D spatial reasoning and sequential decision-making in current MLLMs.
Leaderboard
Loading leaderboard...
Implementations
No implementations linked yet. Add one to showcase related work.