TIR-bench

Name: arXiv/TIR-bench
Author: arXiv

arXiv/TIR-bench

Description

TIR-Bench is a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation within chain-of-thought. It evaluates 22 multimodal large language models, showing the benchmark is universally challenging and that strong performance requires genuine thinking-with-images capabilities, and includes a pilot study comparing direct versus agentic fine-tuning.

arXiv GitHub HuggingFace

Leaderboard

Loading leaderboard...

Implementations (1)

Environment	Stars	Last Updated
GeneralReasoning/TIRBench	0	3 months ago