EndlessTerminals
Endless Terminals
Description
Endless Terminals is an environment for training and evaluating terminal agents on procedurally generated command-line tasks. The benchmark provides diverse terminal-use tasks spanning file operations, log management, data processing, scripting, and database operations. Tasks are designed for reinforcement learning with binary episode-level rewards.
This OpenReward implementation is based on the Endless Terminals repository. The original benchmark contains 3,255 tasks; this implementation includes 2,490 tasks.
Capabilities
- File operations and management
- Log analysis and processing
- Data transformation and scripting
- Database operations
- System administration tasks
- Multi-step terminal command execution
Compute Requirements
Agents are given a sandbox with 1 CPU and 2GB RAM. Each task runs in an isolated Docker container with task-specific files and tooling.
License
Tasks
There is one split in this environment:
- train: 2,490 terminal-based tasks (subset of 3,255 in original benchmark)
Each task provides a containerized environment with specific files and objectives. Agents must execute terminal commands to transform the initial state into the expected final state.
Reward Structure
This is a sparse, verifiable reward environment. Rewards are computed when the agent submits their answer:
- 1.0: All verification tests pass (final state matches expected)
- 0.0: Any test fails
No LLM grader is used. Each task has pytest-based verification scripts (test_initial_state.py, test_final_state.py) that validate the container state.
Data
Task data is sourced from HuggingFace. Each task contains:
instruction.md: Task description and requirementsenvironment/Dockerfile: Task-specific container definitionenvironment/image_sha.txt: Docker image digesttests/test_final_state.py: Pytest verification logictests/test.sh: Test execution wrapper
Tools
Agents have access to 5 tools:
- bash: Execute bash commands in the container
- view: View file contents or directory listings
- str_replace: Replace unique strings in files
- create_file: Create new files with specified content
- submit_answer: Finalize task and run verification tests
Time Horizon
Endless Terminals is a multi-turn environment where agents iteratively execute commands, explore the file system, and modify state before submission.
[Statistics on average tool calls here]
Environment Difficulty
Results from the original paper (dev set performance):
| Model | Before RL | After RL |
|---|---|---|
| Llama-3.2-3B | 4.0% | 18.2% |
| Qwen2.5-7B | 10.7% | 53.3% |
| Qwen3-8B-openthinker-sft | 42.6% | 59.0% |
Gains transfer to human-curated benchmarks like TerminalBench 2.0.
Safety
Endless Terminals tasks are run in isolated Docker containers. Agents interact only with pre-defined task environments and cannot affect external systems or the host machine.
Citations
This environment implements the Endless Terminals benchmark. If you use this environment, please cite the original paper:
@article{gandhi2026endless,
title = {Endless Terminals: Scaling RL Environments for Terminal Agents},
author = {Gandhi, Kanishk and Garg, Shivam and Goodman, Noah D. and Papailiopoulos, Dimitris},
journal = {arXiv preprint arXiv:2601.16443},
year = {2026}
}