KUMO

Description

KUMO is an environment for evaluating complex reasoning in LLMs through an interactive truth-identification game. Given a set of candidate truths and a set of available test actions, the agent must identify the hidden truth by querying tests and reasoning about the results. Tasks are procedurally generated across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization.

This OpenReward implementation is ported from the Harbor Framework implementation originally made by lijrjyan.

Capabilities

Interactive hypothesis testing and elimination
Strategic query selection under a limited action budget
Logical reasoning and deduction from test results
Generalization across diverse domains (algorithms, data structures, etc.)

Compute Requirements

Agents are given a sandboxed environment with bash access and file editing tools. Default sandbox size is 1 CPU and 2 GB RAM.

License

Apache 2.0.

Tasks

There is one split in this environment:

Test: 5,300 reasoning tasks

Each task presents candidate truths and available test actions. The agent has up to 50 queries to identify the hidden truth by reasoning about test outcomes.

Reward Structure

This is a multi-turn, sandbox-based environment. The agent queries tests, reasons about results, writes its identified truth to /app/answer.txt, and calls submit_answer for verification.

1.0: The agent's answer exactly matches the hidden truth.
0.0: The answer is incorrect or missing.

Data

Each task directory contains an instruction.md describing the candidate truths, available actions, and rules. Task data is stored on the OpenReward platform.

Tools

Tool	Description
`bash`	Execute shell commands in the sandbox.
`str_replace`	Replace a unique string in a file.
`view`	View file contents or list directory contents.
`create_file`	Create a new file with specified content.
`submit_answer`	Submit work for automated verification.

Time Horizon

KUMO is a multi-turn environment. Agents read the task, issue test queries, analyze results, and submit their identified truth.

Environment Difficulty

The original paper evaluates 23 state-of-the-art LLMs on 5,000 tasks across 100 domains:

Model	Easy	Hard	Easy–Hard Gap
qwq-32b	90.3%	61.1%	+29.2%
deepseek-v3	89.3%	55.0%	+34.3%
o1-mini	87.8%	62.9%	+24.9%
deepseek-r1	87.1%	61.1%	+26.0%
claude-3.5-sonnet	82.6%	44.0%	+38.6%

Reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. KUMO performance correlates strongly with results on real-world reasoning benchmarks.

Other Environment Requirements

There are no further environment requirements; KUMO works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in KUMO solve abstract reasoning puzzles in a sandboxed environment. The environment does not present direct safety risks.

Citations

@article{lin2025kumo,
  author    = {Lin, Haowei and Wang, Xiangyu and Yan, Ruilin and Huang, Baizhou and Ye, Haotian and Zhu, Jianhua and Wang, Zihao and Zou, James and Ma, Jianzhu and Liang, Yitao},
  title     = {Generative Evaluation of Complex Reasoning in Large Language Models},
  journal   = {arXiv preprint arXiv:2504.02810},
  year      = {2025},
  url       = {https://arxiv.org/abs/2504.02810}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

kumo

GeneralReasoning/kumo

KUMO

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples