terminalbench2
Description
Terminal-Bench 2.0 is a benchmark for evaluating AI agents' ability to autonomously complete hard, long-horizon tasks in computer terminal environments drawn from real workflows. It consists of 89 carefully curated tasks, each with a unique environment, a human-written solution, and comprehensive tests for verification, and shows frontier models and agents score less than 65%.
Leaderboard
Loading leaderboard...
Implementations (1)
| Environment | Stars | Last Updated | |
|---|---|---|---|
1 | 1 months ago |