kumo

Description

KUMO is a generative evaluation framework for assessing reasoning in LLMs that synergistically combines LLMs with symbolic engines to dynamically produce diverse, partially observable, multi-turn reasoning tasks with adjustable difficulty. Through an automated pipeline that continuously generates novel tasks across open-ended domains, KUMO compels models to demonstrate genuine generalization rather than memorization and serves as a contamination-resistant benchmark for long-term evaluation.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
GeneralReasoningGeneralReasoning/kumo
0
1 months ago
arXiv/kumo | OpenReward