terminalbench2

Description

Terminal-Bench 2.0 is a benchmark for evaluating AI agents' ability to autonomously complete hard, long-horizon tasks in computer terminal environments drawn from real workflows. It consists of 89 carefully curated tasks, each with a unique environment, a human-written solution, and comprehensive tests for verification, and shows frontier models and agents score less than 65%.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
GeneralReasoningGeneralReasoning/TerminalBench2
1
1 months ago
arXiv/terminalbench2 | OpenReward