Endless Terminals

Description

Endless Terminals is an environment for training and evaluating terminal agents on procedurally generated command-line tasks. The benchmark provides diverse terminal-use tasks spanning file operations, log management, data processing, scripting, and database operations. Tasks are designed for reinforcement learning with binary episode-level rewards.

This OpenReward implementation is based on the Endless Terminals repository. The original benchmark contains 3,255 tasks; this implementation includes 2,490 tasks.

Capabilities

File operations and management
Log analysis and processing
Data transformation and scripting
Database operations
System administration tasks
Multi-step terminal command execution

Compute Requirements

Agents are given a sandbox with 1 CPU and 2GB RAM. Each task runs in an isolated Docker container with task-specific files and tooling.

License

Apache 2.0

Tasks

There is one split in this environment:

train: 2,490 terminal-based tasks (subset of 3,255 in original benchmark)

Each task provides a containerized environment with specific files and objectives. Agents must execute terminal commands to transform the initial state into the expected final state.

Reward Structure

This is a sparse, verifiable reward environment. Rewards are computed when the agent submits their answer:

1.0: All verification tests pass (final state matches expected)
0.0: Any test fails

No LLM grader is used. Each task has pytest-based verification scripts (test_initial_state.py, test_final_state.py) that validate the container state.

Data

Task data is sourced from HuggingFace. Each task contains:

instruction.md: Task description and requirements
environment/Dockerfile: Task-specific container definition
environment/image_sha.txt: Docker image digest
tests/test_final_state.py: Pytest verification logic
tests/test.sh: Test execution wrapper

Tools

Agents have access to 5 tools:

bash: Execute bash commands in the container
view: View file contents or directory listings
str_replace: Replace unique strings in files
create_file: Create new files with specified content
submit_answer: Finalize task and run verification tests

Time Horizon

Endless Terminals is a multi-turn environment where agents iteratively execute commands, explore the file system, and modify state before submission.

[Statistics on average tool calls here]

Environment Difficulty

Results from the original paper (dev set performance):

Model	Before RL	After RL
Llama-3.2-3B	4.0%	18.2%
Qwen2.5-7B	10.7%	53.3%
Qwen3-8B-openthinker-sft	42.6%	59.0%

Gains transfer to human-curated benchmarks like TerminalBench 2.0.

Safety

Endless Terminals tasks are run in isolated Docker containers. Agents interact only with pre-defined task environments and cannot affect external systems or the host machine.

Citations

This environment implements the Endless Terminals benchmark. If you use this environment, please cite the original paper:

@article{gandhi2026endless,
  title     = {Endless Terminals: Scaling RL Environments for Terminal Agents},
  author    = {Gandhi, Kanishk and Garg, Shivam and Goodman, Noah D. and Papailiopoulos, Dimitris},
  journal   = {arXiv preprint arXiv:2601.16443},
  year      = {2026}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	1 vCPU / 2 GB RAM

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000230
Total	$0.0000550

EndlessTerminals

kanishk/EndlessTerminals

Endless Terminals

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples