kumo

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

KUMO

⭐ OpenReward Environment

Description

KUMO is an environment for evaluating complex reasoning in LLMs through an interactive truth-identification game. Given a set of candidate truths and a set of available test actions, the agent must identify the hidden truth by querying tests and reasoning about the results. Tasks are procedurally generated across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization.

This OpenReward implementation is ported from the Harbor Framework implementation originally made by lijrjyan.

Capabilities

  • Interactive hypothesis testing and elimination
  • Strategic query selection under a limited action budget
  • Logical reasoning and deduction from test results
  • Generalization across diverse domains (algorithms, data structures, etc.)

Compute Requirements

Agents are given a sandboxed environment with bash access and file editing tools. Default sandbox size is 1 CPU and 2 GB RAM.

License

Apache 2.0.

Tasks

There is one split in this environment:

  • Test: 5,300 reasoning tasks

Each task presents candidate truths and available test actions. The agent has up to 50 queries to identify the hidden truth by reasoning about test outcomes.

Reward Structure

This is a multi-turn, sandbox-based environment. The agent queries tests, reasons about results, writes its identified truth to /app/answer.txt, and calls submit_answer for verification.

  • 1.0: The agent's answer exactly matches the hidden truth.
  • 0.0: The answer is incorrect or missing.

Data

Each task directory contains an instruction.md describing the candidate truths, available actions, and rules. Task data is stored on the OpenReward platform.

Tools

ToolDescription
bashExecute shell commands in the sandbox.
str_replaceReplace a unique string in a file.
viewView file contents or list directory contents.
create_fileCreate a new file with specified content.
submit_answerSubmit work for automated verification.

Time Horizon

KUMO is a multi-turn environment. Agents read the task, issue test queries, analyze results, and submit their identified truth.

Environment Difficulty

The original paper evaluates 23 state-of-the-art LLMs on 5,000 tasks across 100 domains:

ModelEasyHardEasy–Hard Gap
qwq-32b90.3%61.1%+29.2%
deepseek-v389.3%55.0%+34.3%
o1-mini87.8%62.9%+24.9%
deepseek-r187.1%61.1%+26.0%
claude-3.5-sonnet82.6%44.0%+38.6%

Reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. KUMO performance correlates strongly with results on real-world reasoning benchmarks.

Other Environment Requirements

There are no further environment requirements; KUMO works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in KUMO solve abstract reasoning puzzles in a sandboxed environment. The environment does not present direct safety risks.

Citations

@article{lin2025kumo,
  author    = {Lin, Haowei and Wang, Xiangyu and Yan, Ruilin and Huang, Baizhou and Ye, Haotian and Zhu, Jianhua and Wang, Zihao and Zou, James and Ma, Jianzhu and Liang, Yitao},
  title     = {Generative Evaluation of Complex Reasoning in Large Language Models},
  journal   = {arXiv preprint arXiv:2504.02810},
  year      = {2025},
  url       = {https://arxiv.org/abs/2504.02810}
}
GeneralReasoning/kumo | OpenReward