cybench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

CyBench

OpenReward Environment

Description

CyBench is an environment for evaluating language model agents on professional-level Capture The Flag (CTF) cybersecurity challenges. It is based on the Cybench benchmark from Stanford CRFM, consisting of 40 challenges from 4 CTF competitions spanning 6 categories: cryptography, web security, reverse engineering, forensics, exploitation, and miscellaneous. Each challenge runs in a sandboxed environment with optional sidecar containers hosting vulnerable services that the agent must interact with. Prompts are available in English and Korean, with easy and hard difficulty variants.

Capabilities

  • Analyzing source code, binaries, and network services for vulnerabilities
  • Crafting exploits across cryptography, web, reverse engineering, forensics, and pwn categories
  • Interacting with live vulnerable services via bash, netcat, curl, and custom scripts
  • File creation and editing within a sandboxed environment
  • Multi-step reasoning and iterative exploitation

Compute Requirements

Agents in CyBench are given a sandbox with 1 CPU and 2GB RAM, network access enabled, running the generalreasoning/evals-cybench-agent-sandbox image. Challenges with vulnerable services spawn additional sidecar containers.

License

Apache-2.0, following the original Cybench repository.

Tasks

There are 40 challenges across 4 splits and 4 CTF competitions.

SplitDescriptionTasks
easyEnglish prompts with hints38
easy_koreanKorean prompts with hints38
hardEnglish prompts without hints39
hard_koreanKorean prompts without hints39

Not all challenges have all variants (e.g., failproof has no easy variant, data_siege has no solution).

Challenges by category:

CategoryCount
Cryptography16
Web8
Reverse Engineering6
Forensics4
Misc4
Pwn2

Challenges by competition:

CompetitionCount
HackTheBox Cyber Apocalypse 202417
SekaiCTF 20238
Glacier CTF 20238
SekaiCTF 20224
HKCert CTF 20232
GCTF 20231

Reward Structure

This is a sparse, binary reward environment. The agent calls the answer tool to submit a flag. The submitted flag is checked via substring matching against the ground truth flag.

  • Correct flag: reward = 1.0, episode ends
  • Incorrect flag: reward = 0.0
  • After 3 incorrect attempts: episode ends with reward = 0.0

We do not use LLM graders for this task.

Data

Challenge files (source code, binaries, configurations) are stored alongside the environment and uploaded to the sandbox at task start. Vulnerable services run as Docker sidecar containers. Ground truth flags and solutions are kept server-side and are not exposed to the agent.

Tools

Agents are given 6 tools:

  • bash: Execute a bash command in the sandbox (600s timeout)
  • view: View file contents with optional line range
  • str_replace: Replace text in a file (shows diff)
  • insert: Insert content at a line number (shows diff)
  • create: Create a new file
  • answer: Submit a flag. Returns whether it was correct. The agent gets up to 3 attempts.

Time Horizon

CyBench is a multi-step environment. The agent iteratively explores the challenge, develops exploits, interacts with services, and submits a flag. The number of tool calls varies significantly by challenge difficulty and category.

Environment Difficulty

Model performance on Cybench from the original paper (unguided, full task success rate):

ModelSuccess Rate
Claude 3.5 Sonnet17.5%
GPT-4o12.5%
OpenAI o1-preview12.5%
Claude 3 Opus10.0%
Llama 3.1 405B5.0%

First solve time is a strong indicator of difficulty for agents. The most difficult task in the benchmark has a human first solve time of 24 hours and 54 minutes.

Other Environment Requirements

CyBench requires an OpenReward API key (api_key secret) for sandbox provisioning. Challenges with sidecar services require the corresponding Docker images to be available.

Safety

CyBench evaluates offensive cybersecurity capabilities of language model agents. The US AISI and UK AISI leveraged Cybench as the only open source cybersecurity benchmark in their joint pre-deployment tests of Anthropic's Claude 3.5 Sonnet and OpenAI o1. All challenges run in isolated, sandboxed environments with no access to real-world systems. Vulnerable services are purpose-built CTF challenges, not production software.

Citations

@inproceedings{zhang2025cybench,
  title={Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models},
  author={Zhang, Andy K. and Perry, Neil and Dulepet, Riya and Ji, Joey and Menders, Celeste and Lin, Justin W. and Jones, Eliot and Hussein, Gashon and Liu, Samantha and Jasper, Donovan and Peetathawatchai, Pura and Glenn, Ari and Sivashankar, Vikram and Zamoshchin, Daniel and Glikbarg, Leo and Askaryar, Derek and Yang, Mike and Zhang, Teddy and Alluri, Rishi and Tran, Nathan and Sangpisit, Rinnara and Yiorkadjis, Polycarpos and Osele, Kenny and Raghupathi, Gautham and Boneh, Dan and Ho, Daniel E. and Liang, Percy},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=tc90LV0yRL}
}
GeneralReasoning/cybench | OpenReward