bountybench

Name: arXiv/bountybench
Author: arXiv

arXiv/bountybench

Cybersecurity Vulnerability Detection, Exploitation, and Patching

Description

BountyBench is a benchmark for evaluating offensive and defensive cyber-capabilities of AI agents in evolving real-world systems. It consists of 25 complex, real-world codebases manually provisioned with environments and 40 bug bounties covering 9 OWASP Top 10 risks, and defines three task types—Detect (new vulnerability detection with a localized success indicator), Exploit (exploiting a specific vulnerability), and Patch (patching a specific vulnerability)—with an information-based strategy to interpolate task difficulty.

arXiv GitHub

Leaderboard

Loading leaderboard...

Implementations (1)

Environment	Stars	Last Updated
GeneralReasoning/BountyBench	0	3 months ago