cybench

Name: arXiv/cybench
Author: arXiv

arXiv/cybench

Cybersecurity Capture the Flag (CTF) Tasks Evaluation

Description

Cybench is a benchmark framework for evaluating autonomous language-model cybersecurity agents via 40 professional-level Capture the Flag (CTF) tasks drawn from four recent competitions, each with task descriptions, starter files, and live execution environments where agents can run commands and observe outputs. It also provides intermediate subtasks for granular evaluation and supports assessments across multiple models and agent scaffolds.

arXiv

Leaderboard

Loading leaderboard...

Implementations (1)

Environment	Stars	Last Updated
GeneralReasoning/cybench	0	3 months ago