bountybench
Description
BountyBench is a benchmark for evaluating offensive and defensive cyber-capabilities of AI agents in evolving real-world systems. It consists of 25 complex, real-world codebases manually provisioned with environments and 40 bug bounties covering 9 OWASP Top 10 risks, and defines three task types—Detect (new vulnerability detection with a localized success indicator), Exploit (exploiting a specific vulnerability), and Patch (patching a specific vulnerability)—with an information-based strategy to interpolate task difficulty.
Leaderboard
Loading leaderboard...
Implementations (1)
| Environment | Stars | Last Updated | |
|---|---|---|---|
0 | 2 months ago |