TIR-bench
Description
TIR-Bench is a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation within chain-of-thought. It evaluates 22 multimodal large language models, showing the benchmark is universally challenging and that strong performance requires genuine thinking-with-images capabilities, and includes a pilot study comparing direct versus agentic fine-tuning.
Leaderboard
Loading leaderboard...
Implementations (1)
| Environment | Stars | Last Updated | |
|---|---|---|---|
0 | 1 months ago |