TIR-bench

Description

TIR-Bench is a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation within chain-of-thought. It evaluates 22 multimodal large language models, showing the benchmark is universally challenging and that strong performance requires genuine thinking-with-images capabilities, and includes a pilot study comparing direct versus agentic fine-tuning.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
GeneralReasoningGeneralReasoning/TIRBench
0
1 months ago
arXiv/TIR-bench | OpenReward