bountybench

Description

BountyBench is a benchmark for evaluating offensive and defensive cyber-capabilities of AI agents in evolving real-world systems. It consists of 25 complex, real-world codebases manually provisioned with environments and 40 bug bounties covering 9 OWASP Top 10 risks, and defines three task types—Detect (new vulnerability detection with a localized success indicator), Exploit (exploiting a specific vulnerability), and Patch (patching a specific vulnerability)—with an information-based strategy to interpolate task difficulty.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
GeneralReasoningGeneralReasoning/BountyBench
0
2 months ago
arXiv/bountybench | OpenReward