kumo
KUMO
Description
KUMO is an environment for evaluating complex reasoning in LLMs through an interactive truth-identification game. Given a set of candidate truths and a set of available test actions, the agent must identify the hidden truth by querying tests and reasoning about the results. Tasks are procedurally generated across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization.
This OpenReward implementation is ported from the Harbor Framework implementation originally made by lijrjyan.
Capabilities
- Interactive hypothesis testing and elimination
- Strategic query selection under a limited action budget
- Logical reasoning and deduction from test results
- Generalization across diverse domains (algorithms, data structures, etc.)
Compute Requirements
Agents are given a sandboxed environment with bash access and file editing tools. Default sandbox size is 1 CPU and 2 GB RAM.
License
Tasks
There is one split in this environment:
- Test: 5,300 reasoning tasks
Each task presents candidate truths and available test actions. The agent has up to 50 queries to identify the hidden truth by reasoning about test outcomes.
Reward Structure
This is a multi-turn, sandbox-based environment. The agent queries tests, reasons about results, writes its identified truth to /app/answer.txt, and calls submit_answer for verification.
- 1.0: The agent's answer exactly matches the hidden truth.
- 0.0: The answer is incorrect or missing.
Data
Each task directory contains an instruction.md describing the candidate truths, available actions, and rules. Task data is stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
bash | Execute shell commands in the sandbox. |
str_replace | Replace a unique string in a file. |
view | View file contents or list directory contents. |
create_file | Create a new file with specified content. |
submit_answer | Submit work for automated verification. |
Time Horizon
KUMO is a multi-turn environment. Agents read the task, issue test queries, analyze results, and submit their identified truth.
Environment Difficulty
The original paper evaluates 23 state-of-the-art LLMs on 5,000 tasks across 100 domains:
| Model | Easy | Hard | Easy–Hard Gap |
|---|---|---|---|
| qwq-32b | 90.3% | 61.1% | +29.2% |
| deepseek-v3 | 89.3% | 55.0% | +34.3% |
| o1-mini | 87.8% | 62.9% | +24.9% |
| deepseek-r1 | 87.1% | 61.1% | +26.0% |
| claude-3.5-sonnet | 82.6% | 44.0% | +38.6% |
Reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. KUMO performance correlates strongly with results on real-world reasoning benchmarks.
Other Environment Requirements
There are no further environment requirements; KUMO works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in KUMO solve abstract reasoning puzzles in a sandboxed environment. The environment does not present direct safety risks.
Citations
@article{lin2025kumo,
author = {Lin, Haowei and Wang, Xiangyu and Yan, Ruilin and Huang, Baizhou and Ye, Haotian and Zhu, Jianhua and Wang, Zihao and Zou, James and Ma, Jianzhu and Liang, Yitao},
title = {Generative Evaluation of Complex Reasoning in Large Language Models},
journal = {arXiv preprint arXiv:2504.02810},
year = {2025},
url = {https://arxiv.org/abs/2504.02810}
}