TerminalBench2

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

Terminal-Bench 2.0

OpenReward Environment

Description

Terminal-Bench 2.0 is an environment for evaluating agents on complex terminal-based tasks in containerized environments. Tasks span diverse domains including protein assembly, cryptanalysis, async debugging, security vulnerability resolution, ML model recovery, compiler building, and system administration. Each task runs in a dedicated Docker container with task-specific tooling.

Capabilities

  • Complex terminal and shell operations
  • File manipulation and code editing
  • System debugging and configuration
  • Machine learning and data processing
  • Security analysis and vulnerability fixing
  • Compiler and build system tasks
  • Database recovery and optimization

Compute Requirements

Agents are given a sandbox with 1 CPU and 2GB RAM per task. Each task runs in a pre-built Docker container with task-specific dependencies.

License

Apache 2.0

Tasks

There is one split in this environment:

  • test: 89 tasks

Tasks include:

  • Scientific Computing: adaptive-rejection-sampler, mcmc-sampling-stan, raman-fitting, portfolio-optimization
  • ML/AI: caffe-cifar-10, hf-model-inference, pytorch-model-recovery, gpt2-codegolf, llm-inference-batching-scheduler
  • Systems: qemu-startup, qemu-alpine-ssh, nginx-request-logging, kv-store-grpc
  • Compilers/Build: build-cython-ext, compile-compcert, build-pov-ray, build-pmars
  • Security: crack-7z-hash, feal-differential-cryptanalysis, feal-linear-cryptanalysis, fix-code-vulnerability, vulnerable-secret
  • Git/VCS: fix-git, git-leak-recovery, git-multibranch, sanitize-git-repo
  • Bioinformatics: dna-assembly, dna-insert, protein-assembly
  • And many more...

Reward Structure

This is a sparse, verifiable reward environment. The agent calls submit_answer to run the task's test suite:

  • 1.0: All tests pass
  • 0.0: Tests fail or partial completion

Rewards are read from /logs/verifier/reward.txt or /logs/verifier/reward.json after test execution.

Data

Each task directory contains:

  • instruction.md: Task description and requirements
  • environment/Dockerfile: Task-specific container image
  • tests/test.sh: Verification script
  • solution/solve.sh: Reference solution (not visible to agent)

Task data is stored in the repository.

Tools

ToolDescription
bashExecute bash commands in the container
viewView file contents or directory listings
str_replaceReplace a unique string in a file
create_fileCreate a new file with specified content
submit_answerRun test suite and submit solution

Time Horizon

Multi-turn. Agents explore the containerized environment, implement solutions using bash commands and file operations, then submit for verification.

Environment Difficulty

Terminal-Bench 2.0 contains tasks designed to test frontier agent capabilities. Tasks require deep domain expertise, multi-step reasoning, and precise command-line operations. Each task has received extensive human and LM-assisted validation to ensure solvability and realistic difficulty.

Other Environment Requirements

There are no further environment requirements; Terminal-Bench 2.0 works out of the box with the OpenReward endpoint.

Safety

Agents in Terminal-Bench 2.0 operate in isolated Docker containers. Tasks involve system administration, security analysis, and code execution within sandboxed environments. The environment does not present direct safety risks.

Citation

@misc{merrill2026terminalbenchbenchmarkingagentshard,
  title={Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces},
  author={Mike A. Merrill and Alexander G. Shaw and Nicholas Carlini and Boxuan Li and Harsh Raj and Ivan Bercovich and Lin Shi and Jeong Yeon Shin and Thomas Walshe and E. Kelly Buchanan and Junhong Shen and Guanghao Ye and Haowei Lin and Jason Poulos and Maoyu Wang and Marianna Nezhurina and Jenia Jitsev and Di Lu and Orfeas Menis Mastromichalakis and Zhiwei Xu and Zizhao Chen and Yue Liu and Robert Zhang and Leon Liangyu Chen and Anurag Kashyap and Jan-Lucas Uslu and Jeffrey Li and Jianbo Wu and Minghao Yan and Song Bian and Vedang Sharma and Ke Sun and Steven Dillmann and Akshay Anand and Andrew Lanpouthakoun and Bardia Koopah and Changran Hu and Etash Guha and Gabriel H. S. Dreiman and Jiacheng Zhu and Karl Krauth and Li Zhong and Niklas Muennighoff and Robert Amanfu and Shangyin Tan and Shreyas Pimpalgaonkar and Tushar Aggarwal and Xiangning Lin and Xin Lan and Xuandong Zhao and Yiqing Liang and Yuanli Wang and Zilong Wang and Changzhi Zhou and David Heineman and Hange Liu and Harsh Trivedi and John Yang and Junhong Lin and Manish Shetty and Michael Yang and Nabil Omi and Negin Raoof and Shanda Li and Terry Yue Zhuo and Wuwei Lin and Yiwei Dai and Yuxin Wang and Wenhao Chai and Shang Zhou and Dariush Wahdany and Ziyu She and Jiaming Hu and Zhikang Dong and Yuxuan Zhu and Sasha Cui and Ahson Saiyed and Arinbjörn Kolbeinsson and Jesse Hu and Christopher Michael Rytting and Ryan Marten and Yixin Wang and Alex Dimakis and Andy Konwinski and Ludwig Schmidt},
  year={2026},
  eprint={2601.11868},
  archivePrefix={arXiv},
  primaryClass={cs.SE},
  url={https://arxiv.org/abs/2601.11868}
}
GeneralReasoning/TerminalBench2 | OpenReward