TerminalBench2
Terminal-Bench 2.0
Description
Terminal-Bench 2.0 is an environment for evaluating agents on complex terminal-based tasks in containerized environments. Tasks span diverse domains including protein assembly, cryptanalysis, async debugging, security vulnerability resolution, ML model recovery, compiler building, and system administration. Each task runs in a dedicated Docker container with task-specific tooling.
Capabilities
- Complex terminal and shell operations
- File manipulation and code editing
- System debugging and configuration
- Machine learning and data processing
- Security analysis and vulnerability fixing
- Compiler and build system tasks
- Database recovery and optimization
Compute Requirements
Agents are given a sandbox with 1 CPU and 2GB RAM per task. Each task runs in a pre-built Docker container with task-specific dependencies.
License
Tasks
There is one split in this environment:
- test: 89 tasks
Tasks include:
- Scientific Computing: adaptive-rejection-sampler, mcmc-sampling-stan, raman-fitting, portfolio-optimization
- ML/AI: caffe-cifar-10, hf-model-inference, pytorch-model-recovery, gpt2-codegolf, llm-inference-batching-scheduler
- Systems: qemu-startup, qemu-alpine-ssh, nginx-request-logging, kv-store-grpc
- Compilers/Build: build-cython-ext, compile-compcert, build-pov-ray, build-pmars
- Security: crack-7z-hash, feal-differential-cryptanalysis, feal-linear-cryptanalysis, fix-code-vulnerability, vulnerable-secret
- Git/VCS: fix-git, git-leak-recovery, git-multibranch, sanitize-git-repo
- Bioinformatics: dna-assembly, dna-insert, protein-assembly
- And many more...
Reward Structure
This is a sparse, verifiable reward environment. The agent calls submit_answer to run the task's test suite:
- 1.0: All tests pass
- 0.0: Tests fail or partial completion
Rewards are read from /logs/verifier/reward.txt or /logs/verifier/reward.json after test execution.
Data
Each task directory contains:
instruction.md: Task description and requirementsenvironment/Dockerfile: Task-specific container imagetests/test.sh: Verification scriptsolution/solve.sh: Reference solution (not visible to agent)
Task data is stored in the repository.
Tools
| Tool | Description |
|---|---|
bash | Execute bash commands in the container |
view | View file contents or directory listings |
str_replace | Replace a unique string in a file |
create_file | Create a new file with specified content |
submit_answer | Run test suite and submit solution |
Time Horizon
Multi-turn. Agents explore the containerized environment, implement solutions using bash commands and file operations, then submit for verification.
Environment Difficulty
Terminal-Bench 2.0 contains tasks designed to test frontier agent capabilities. Tasks require deep domain expertise, multi-step reasoning, and precise command-line operations. Each task has received extensive human and LM-assisted validation to ensure solvability and realistic difficulty.
Other Environment Requirements
There are no further environment requirements; Terminal-Bench 2.0 works out of the box with the OpenReward endpoint.
Safety
Agents in Terminal-Bench 2.0 operate in isolated Docker containers. Tasks involve system administration, security analysis, and code execution within sandboxed environments. The environment does not present direct safety risks.
Citation
@misc{merrill2026terminalbenchbenchmarkingagentshard,
title={Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces},
author={Mike A. Merrill and Alexander G. Shaw and Nicholas Carlini and Boxuan Li and Harsh Raj and Ivan Bercovich and Lin Shi and Jeong Yeon Shin and Thomas Walshe and E. Kelly Buchanan and Junhong Shen and Guanghao Ye and Haowei Lin and Jason Poulos and Maoyu Wang and Marianna Nezhurina and Jenia Jitsev and Di Lu and Orfeas Menis Mastromichalakis and Zhiwei Xu and Zizhao Chen and Yue Liu and Robert Zhang and Leon Liangyu Chen and Anurag Kashyap and Jan-Lucas Uslu and Jeffrey Li and Jianbo Wu and Minghao Yan and Song Bian and Vedang Sharma and Ke Sun and Steven Dillmann and Akshay Anand and Andrew Lanpouthakoun and Bardia Koopah and Changran Hu and Etash Guha and Gabriel H. S. Dreiman and Jiacheng Zhu and Karl Krauth and Li Zhong and Niklas Muennighoff and Robert Amanfu and Shangyin Tan and Shreyas Pimpalgaonkar and Tushar Aggarwal and Xiangning Lin and Xin Lan and Xuandong Zhao and Yiqing Liang and Yuanli Wang and Zilong Wang and Changzhi Zhou and David Heineman and Hange Liu and Harsh Trivedi and John Yang and Junhong Lin and Manish Shetty and Michael Yang and Nabil Omi and Negin Raoof and Shanda Li and Terry Yue Zhuo and Wuwei Lin and Yiwei Dai and Yuxin Wang and Wenhao Chai and Shang Zhou and Dariush Wahdany and Ziyu She and Jiaming Hu and Zhikang Dong and Yuxuan Zhu and Sasha Cui and Ahson Saiyed and Arinbjörn Kolbeinsson and Jesse Hu and Christopher Michael Rytting and Ryan Marten and Yixin Wang and Alex Dimakis and Andy Konwinski and Ludwig Schmidt},
year={2026},
eprint={2601.11868},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2601.11868}
}