kumo

Name: arXiv/kumo
Author: arXiv

arXiv/kumo

Complex Reasoning Evaluation in Large Language Models

Description

KUMO is a generative evaluation framework for assessing reasoning in LLMs that synergistically combines LLMs with symbolic engines to dynamically produce diverse, partially observable, multi-turn reasoning tasks with adjustable difficulty. Through an automated pipeline that continuously generates novel tasks across open-ended domains, KUMO compels models to demonstrate genuine generalization rather than memorization and serves as a contamination-resistant benchmark for long-term evaluation.

arXiv

Leaderboard

Loading leaderboard...

Implementations (1)

Environment	Stars	Last Updated
GeneralReasoning/kumo	0	3 months ago