deepsynth
Description
DEEPSYNTH is a benchmark for evaluating LLM-based agents on realistic, time-consuming problems that require information gathering, synthesis, and structured reasoning to produce verifiable insights. It contains 120 tasks collected across 7 domains and 67 countries, constructed via a multi-stage pipeline in which annotators gather official data sources, create hypotheses, perform manual analysis, and design tasks with verifiable answers.
Leaderboard
Loading leaderboard...
Implementations (1)
| Environment | Stars | Last Updated | |
|---|---|---|---|
0 | 2 months ago |