terminalbench2

Name: arXiv/terminalbench2
Author: arXiv

arXiv/terminalbench2

Description

Terminal-Bench 2.0 is a benchmark for evaluating AI agents' ability to autonomously complete hard, long-horizon tasks in computer terminal environments drawn from real workflows. It consists of 89 carefully curated tasks, each with a unique environment, a human-written solution, and comprehensive tests for verification, and shows frontier models and agents score less than 65%.

arXiv GitHub

Leaderboard

Loading leaderboard...

Implementations (1)

Environment	Stars	Last Updated
GeneralReasoning/TerminalBench2	1	3 months ago