Terminal-Bench 2.0

Description

Terminal-Bench 2.0 is an environment for evaluating agents on complex terminal-based tasks in containerized environments. Tasks span diverse domains including protein assembly, cryptanalysis, async debugging, security vulnerability resolution, ML model recovery, compiler building, and system administration. Each task runs in a dedicated Docker container with task-specific tooling.

Capabilities

Complex terminal and shell operations
File manipulation and code editing
System debugging and configuration
Machine learning and data processing
Security analysis and vulnerability fixing
Compiler and build system tasks
Database recovery and optimization

Compute Requirements

Agents are given a sandbox with 1 CPU and 2GB RAM per task. Each task runs in a pre-built Docker container with task-specific dependencies.

License

Apache 2.0

Tasks

There is one split in this environment:

test: 89 tasks

Tasks include:

Scientific Computing: adaptive-rejection-sampler, mcmc-sampling-stan, raman-fitting, portfolio-optimization
ML/AI: caffe-cifar-10, hf-model-inference, pytorch-model-recovery, gpt2-codegolf, llm-inference-batching-scheduler
Systems: qemu-startup, qemu-alpine-ssh, nginx-request-logging, kv-store-grpc
Compilers/Build: build-cython-ext, compile-compcert, build-pov-ray, build-pmars
Security: crack-7z-hash, feal-differential-cryptanalysis, feal-linear-cryptanalysis, fix-code-vulnerability, vulnerable-secret
Git/VCS: fix-git, git-leak-recovery, git-multibranch, sanitize-git-repo
Bioinformatics: dna-assembly, dna-insert, protein-assembly
And many more...

Reward Structure

This is a sparse, verifiable reward environment. The agent calls submit_answer to run the task's test suite:

1.0: All tests pass
0.0: Tests fail or partial completion

Rewards are read from /logs/verifier/reward.txt or /logs/verifier/reward.json after test execution.

Data

Each task directory contains:

instruction.md: Task description and requirements
environment/Dockerfile: Task-specific container image
tests/test.sh: Verification script
solution/solve.sh: Reference solution (not visible to agent)

Task data is stored in the repository.

Tools

Tool	Description
`bash`	Execute bash commands in the container
`view`	View file contents or directory listings
`str_replace`	Replace a unique string in a file
`create_file`	Create a new file with specified content
`submit_answer`	Run test suite and submit solution

Time Horizon

Multi-turn. Agents explore the containerized environment, implement solutions using bash commands and file operations, then submit for verification.

Environment Difficulty

Terminal-Bench 2.0 contains tasks designed to test frontier agent capabilities. Tasks require deep domain expertise, multi-step reasoning, and precise command-line operations. Each task has received extensive human and LM-assisted validation to ensure solvability and realistic difficulty.

Other Environment Requirements

There are no further environment requirements; Terminal-Bench 2.0 works out of the box with the OpenReward endpoint.

Safety

Agents in Terminal-Bench 2.0 operate in isolated Docker containers. Tasks involve system administration, security analysis, and code execution within sandboxed environments. The environment does not present direct safety risks.

Citation

@misc{merrill2026terminalbenchbenchmarkingagentshard,
  title={Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces},
  author={Mike A. Merrill and Alexander G. Shaw and Nicholas Carlini and Boxuan Li and Harsh Raj and Ivan Bercovich and Lin Shi and Jeong Yeon Shin and Thomas Walshe and E. Kelly Buchanan and Junhong Shen and Guanghao Ye and Haowei Lin and Jason Poulos and Maoyu Wang and Marianna Nezhurina and Jenia Jitsev and Di Lu and Orfeas Menis Mastromichalakis and Zhiwei Xu and Zizhao Chen and Yue Liu and Robert Zhang and Leon Liangyu Chen and Anurag Kashyap and Jan-Lucas Uslu and Jeffrey Li and Jianbo Wu and Minghao Yan and Song Bian and Vedang Sharma and Ke Sun and Steven Dillmann and Akshay Anand and Andrew Lanpouthakoun and Bardia Koopah and Changran Hu and Etash Guha and Gabriel H. S. Dreiman and Jiacheng Zhu and Karl Krauth and Li Zhong and Niklas Muennighoff and Robert Amanfu and Shangyin Tan and Shreyas Pimpalgaonkar and Tushar Aggarwal and Xiangning Lin and Xin Lan and Xuandong Zhao and Yiqing Liang and Yuanli Wang and Zilong Wang and Changzhi Zhou and David Heineman and Hange Liu and Harsh Trivedi and John Yang and Junhong Lin and Manish Shetty and Michael Yang and Nabil Omi and Negin Raoof and Shanda Li and Terry Yue Zhuo and Wuwei Lin and Yiwei Dai and Yuxin Wang and Wenhao Chai and Shang Zhou and Dariush Wahdany and Ziyu She and Jiaming Hu and Zhikang Dong and Yuxuan Zhu and Sasha Cui and Ahson Saiyed and Arinbjörn Kolbeinsson and Jesse Hu and Christopher Michael Rytting and Ryan Marten and Yixin Wang and Alex Dimakis and Andy Konwinski and Ludwig Schmidt},
  year={2026},
  eprint={2601.11868},
  archivePrefix={arXiv},
  primaryClass={cs.SE},
  url={https://arxiv.org/abs/2601.11868}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

TerminalBench2

GeneralReasoning/TerminalBench2

Terminal-Bench 2.0

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Tools

Compute Configuration

Estimated Cost

Examples