deepsynth

Name: arXiv/deepsynth
Author: arXiv

arXiv/deepsynth

Description

DEEPSYNTH is a benchmark for evaluating LLM-based agents on realistic, time-consuming problems that require information gathering, synthesis, and structured reasoning to produce verifiable insights. It contains 120 tasks collected across 7 domains and 67 countries, constructed via a multi-stage pipeline in which annotators gather official data sources, create hypotheses, perform manual analysis, and design tasks with verifiable answers.

arXiv GitHub HuggingFace

Leaderboard

Loading leaderboard...

Implementations (1)

Environment	Stars	Last Updated
GeneralReasoning/DeepSynth	0	3 months ago