deepsynth

Description

DEEPSYNTH is a benchmark for evaluating LLM-based agents on realistic, time-consuming problems that require information gathering, synthesis, and structured reasoning to produce verifiable insights. It contains 120 tasks collected across 7 domains and 67 countries, constructed via a multi-stage pipeline in which annotators gather official data sources, create hypotheses, perform manual analysis, and design tasks with verifiable answers.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
GeneralReasoningGeneralReasoning/DeepSynth
0
2 months ago
arXiv/deepsynth | OpenReward