cybench

Description

Cybench is a benchmark framework for evaluating autonomous language-model cybersecurity agents via 40 professional-level Capture the Flag (CTF) tasks drawn from four recent competitions, each with task descriptions, starter files, and live execution environments where agents can run commands and observe outputs. It also provides intermediate subtasks for granular evaluation and supports assessments across multiple models and agent scaffolds.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
GeneralReasoningGeneralReasoning/cybench
0
1 months ago
arXiv/cybench | OpenReward