SpecBench

Measuring reward hacking in long-horizon coding agents.

SpecBench is a benchmark of 30 systems-level programming tasks (JSON parser to OS kernel) that measures whether coding agents genuinely satisfy specifications or just optimize the visible test suite. Each task has two test suites: validation tests (visible to the agent during optimization) and held-out tests (hidden from the agent, used only for evaluation). The reward hacking gap is the difference between pass rates on these two suites.

Paper: SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

Setup

1. Install SpecBench

git clone https://github.com/WecoAI/SpecBench.git
cd SpecBench
pip install -e .

2. Install an inner agent

SpecBench wraps existing coding agents (the "inner agent") inside an outer search loop. You need at least one installed:

Claude Code (default):

npm install -g @anthropic-ai/claude-code
claude login
claude --version

Claude Code uses your Anthropic credentials directly. No extra env vars needed.

Codex:

npm install -g @openai/codex
export OPENAI_API_KEY="sk-..."
codex --version

OpenCode:

# Install OpenCode (see https://github.com/opencode-ai/opencode)
# Configure with your preferred provider:
export OPENAI_API_KEY="sk-..."
opencode --version

Running Experiments

SpecBench uses a two-level architecture: an inner agent (the coding agent that writes and edits code) wrapped by an outer search loop (the search strategy that decides which candidates to refine).

# Quick sanity check (5 steps, one task)
python -m experiments.cli --config dev --agent claude_code --task json_parser

# Short run (10 steps, all tasks)
python -m experiments.cli --config short --agent claude_code

# Medium run (25 steps)
python -m experiments.cli --config medium --agent codex --model gpt-5.2-codex

# Full run (50 steps)
python -m experiments.cli --config full --agent claude_code

# Single task with a specific seed
python -m experiments.cli --config short --agent claude_code --task sql_database --seed 42

# Multiple seeds in parallel
python -m experiments.cli --config short --agent claude_code --num-seeds 3 --parallel 3

Search Strategies

The outer loop supports three search strategies (see paper Section 3):

# AIDE tree search (default) — draft/debug/improve branching
python -m experiments.cli --config short --agent claude_code --search-mode aide

# Linear — sequential refinement, no branching
python -m experiments.cli --config short --agent claude_code --search-mode linear

# Autoresearch — single chain, keeps best candidate so far
python -m experiments.cli --config short --agent claude_code --search-mode autoresearch

Using different inner agents

# Claude Code (default)
python -m experiments.cli --agent claude_code

# Codex (requires OPENAI_API_KEY)
python -m experiments.cli --agent codex --model gpt-5.2-codex

# OpenCode with various models
python -m experiments.cli --agent opencode --model openrouter/anthropic/claude-opus-4
python -m experiments.cli --agent opencode --model deepseek/deepseek-v3.2
python -m experiments.cli --agent opencode --model kimi-k2.5

All CLI Options

--config {dev,short,medium,full}    Preset (steps/drafts/budget)
--task TASK                         Single task name (default: all 30)
--agent AGENT                       claude_code | opencode | codex
--model MODEL                       LLM model name
--seed SEED                         Random seed
--num-seeds N                       Run N seeds (0..N-1)
--parallel N                        Parallel workers
--steps N                           Override step count
--cost-budget N                     Max cost in USD
--time-budget N                     Max time in seconds
--search-mode MODE                  aide | linear | autoresearch
--out-dir DIR                       Output directory (default: results/spec_bench)
--no-private-eval                   Skip held-out test evaluation
--difficulty-level {1,2,3,4}        Validation test visibility level
--curriculum                        Progressive difficulty over steps

Tasks

30 systems-level tasks spanning C, Python, and Go. Each task includes a natural-language specification, starter code (stubs), a reference implementation that passes all tests, and both test suites.

Task	Lang	LOC	Validation Tests	Held-out Tests	Domain
json_parser	Py	1.5K	45	178	Parser
package_resolver	Py	3K	32	50	Resolver
http_server	Py	5K	31	144	Server
regex_engine	Py	5K	40	125	Engine
sed_interpreter	Py	5K	118	77	Interpreter
tinygrad	Py	5K	70	76	ML Library
lox_vm	C	5K	52	92	VM
filesystem	C	8K	40	54	System
markdown_renderer	Py	8K	49	125	Renderer
deflate_compression	Py	10K	35	139	Codec
git_impl	Py	10K	25	69	VCS
spreadsheet_engine	Py	10K	34	90	Engine
ray_tracer	C	12K	29	23	Graphics
wasm_interpreter	C	12K	159	61	VM
shell_interpreter	C	14K	41	110	Interpreter
crypto_primitives	Py	15K	24	57	Crypto
css_layout_engine	Py	15K	127	107	Engine
http2_protocol	Py	15K	46	42	Protocol
riscv_emulator	C	15K	50	98	Emulator
tcp_stack	C	15K	42	31	Network
gnu_make	Py	20K	159	102	Build Tool
nes_emulator	C	20K	52	103	Emulator
coreutils	C	25K	48	119	System Utils
database_engine	C	25K	40	25	Database
gollum_compiler	Go	25K	33	52	Compiler
gameboy_emulator	C	30K	50	117	Emulator
sql_database	C	30K	15	11	Database
c_compiler	C	50K	46	299	Compiler
elf_linker	C	50K	35	63	Linker
javascript_engine	C	60K	130	72	Engine
os_kernel	C	110K	36	38	OS Kernel

Total: 1,779 validation tests, 2,783 held-out tests across 30 tasks.

How It Works

SpecBench uses a two-level architecture. The outer search loop (AIDE, Linear, or Autoresearch) manages a tree of candidate implementations. At each step, it selects a candidate, invokes the inner coding agent in an isolated workspace, runs the validation tests, and uses the score to guide the next step. Held-out tests are run for evaluation only and are never shown to the agent.

Outer search loop (AIDE / Linear / Autoresearch)
  |
  |-- Select candidate node from search tree
  |-- Invoke inner agent (Claude Code / Codex / OpenCode) in isolated workspace
  |-- Run validation tests (T_val) → score feeds back to search
  |-- Run held-out tests (T_test) → recorded for analysis (hidden from agent)
  |-- Update search tree

The reward hacking gap = validation pass rate − held-out pass rate. A genuine implementation that follows the specification will have a gap near zero.

Output

Each run produces JSON files in the output directory:

results/spec_bench/exp_short_claude_code_json_parser_s0_20260518/
├── spec_json_parser_seed0_specbench.json   # Per-step validation + held-out scores
├── spec_json_parser_seed0_aide_run.json    # Search tree state
└── workspaces/                             # Agent workspaces per search node

Project Structure

specbench/
├── aide/                           # AIDE tree search framework
│   ├── agent.py                    # Tree search, nodes, search policy
│   ├── backend.py                  # LLM API client (OpenAI-compatible)
│   └── logging.py
├── benchmarks/
│   ├── base.py                     # TaskSpec, EvalResult interfaces
│   └── spec_bench/
│       ├── adapter.py              # Task registry, test evaluation
│       ├── run_loop.py             # Outer search + inner agent integration
│       ├── workspace.py            # Isolated workspace management
│       ├── agents/                 # Inner coding agents
│       │   ├── claude_code.py      # Claude Code CLI wrapper
│       │   ├── codex.py            # Codex CLI wrapper
│       │   ├── opencode.py         # OpenCode CLI wrapper
│       │   └── ...
│       ├── evaluation/
│       │   └── runner.py           # pytest subprocess runner
│       └── tasks/                  # 30 task definitions
│           ├── json_parser/
│           ├── c_compiler/
│           ├── os_kernel/
│           └── ...
├── experiments/
│   ├── cli.py                      # CLI entry point
│   └── spec_bench/
│       ├── config.py               # dev/short/medium/full presets
│       └── run_specbench.py        # Experiment runner
└── pyproject.toml

Citation

@article{zhao2026specbench,
  title={SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents},
  author={Zhao, Bingchen and Srikanth, Dhruv and Wu, Yuxiang and Jiang, Zhengyao},
  journal={arXiv preprint arXiv:2605.21384},
  year={2026}
}

License

Apache 2.0. See LICENSE.

Component	Configuration
Environment Server	2 vCPUs / 8 GB RAM
Sandbox Machine	1 vCPU / 2 GB RAM

Component	Cost / second
Environment	$0.0000640
Sandbox	$0.0000230
Total	$0.0000870

specbench

zhaobc/specbench