API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/bountybench

README

BountyBench

Description

BountyBench is a cybersecurity benchmark environment for evaluating AI agents on real-world bug bounties. Agents operate in containers with full codebase access and running target services, tasked with detecting vulnerabilities, writing exploits, or patching security flaws across 30+ real-world software systems.

Based on the BountyBench benchmark with 46 bug bounties across 31 systems and 3 task phases.

Capabilities

Vulnerability detection in real-world codebases
Exploit development for known CVEs
Security patch authoring that preserves existing functionality
Interaction with live services (web apps, APIs, databases)

Compute Requirements

Each task runs in a dedicated sandbox container with:

1 vCPU, 2 GB RAM
Network access (for service-based systems)
Per-system Docker images with pre-configured environments

Tasks

138 tasks total: 46 bounties x 3 phases (detect, exploit, patch).

Each task targets a specific (system, bounty, phase) triple:

detect: Find an unknown vulnerability and write exploit.sh
exploit: Exploit a known vulnerability (CVE/CWE provided) and write exploit.sh
patch: Fix a known vulnerability without breaking existing functionality

Systems include: lunary, yaml, node, django, fastapi, gradio, mlflow, curl, gunicorn, and 20+ more.

Reward Structure

Binary rewards (0.0 or 1.0), verified by shell scripts:

Detect/Exploit: Agent's exploit.sh is run, then a hidden verify.sh checks if the vulnerability was demonstrated. Reward = 1.0 if verify.sh exits 0.
Patch: A reference exploit is run against the patched codebase, then verify.sh checks if the vulnerability is still present. Reward = 1.0 if the exploit is blocked AND invariant tests pass.

Data

Source: bountybench/bountytasks
Format: task_index.json generated by prepare_data.py
Per-system Docker images built by build_images.sh

Tools

Tool	Description
`bash`	Execute arbitrary bash commands in the sandbox
`list_files`	List directory contents
`read_file`	Read file contents
`write_file`	Write content to a file
`submit`	Submit work for automated grading

Time Horizon

Multi-turn. Agents typically need 10-50+ tool calls depending on the phase:

Detect: extensive code analysis + exploit development
Exploit: targeted exploit development
Patch: code analysis + targeted fix

Environment Difficulty

Varies by system and vulnerability type:

Severity ranges from 3.0 to 10.0 (CVSS)
CWE types include injection, auth bypass, SSRF, path traversal, and more

Safety

This environment involves real vulnerability exploitation techniques. All activities are sandboxed and isolated. The environment is designed for security research and AI capability evaluation only.

Citations

@article{zhang2025bountybench,
      title={BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems},
      author={Andy K. Zhang and Joey Ji and Celeste Menders and Riya Dulepet and Thomas Qin and Ron Y. Wang and Junrong Wu and Kyleen Liao and Jiliang Li and Jinghan Hu and Sara Hong and Nardos Demilew and Shivatmica Murgai and Jason Tran and Nishka Kacheria and Ethan Ho and Denis Liu and Lauren McLane and Olivia Bruvik and Dai-Rong Han and Seungwoo Kim and Akhil Vyas and Cuiyuanxiu Chen and Ryan Li and Weiran Xu and Jonathan Z. Ye and Prerit Choudhary and Siddharth M. Bhatia and Vikram Sivashankar and Yuxuan Bao and Dawn Song and Dan Boneh and Daniel E. Ho and Percy Liang},
      year={2025},
      eprint={2505.15216},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
}

Repository

Source repository

EnvCommons/BountyBench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	1 vCPU / 2 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000230
Total	$0.0000550

Examples

5-minute session$0.0165

1-hour session$0.1980

BountyBench

GeneralReasoning/BountyBench

BountyBench

Description

Capabilities

Compute Requirements

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples