FeatureBench
FeatureBench
Description
FeatureBench is an execution-based benchmark for evaluating AI coding agents on feature-level development tasks. Unlike SWE-bench which focuses on bug fixing, FeatureBench requires agents to implement complete, production-ready features in real-world Python repositories. Agents receive a detailed feature specification and must modify the source code so that held-out tests pass.
Capabilities
- Reading and understanding large Python codebases
- Implementing new features from natural language specifications
- Multi-file code editing across complex repositories
- Reasoning about test expectations and software architecture
Compute Requirements
Agents are given a sandboxed Docker environment with a pre-built instance image per task. Default sandbox size is 2 CPU and 4 GB RAM.
License
MIT. The underlying FeatureBench dataset is subject to its own license terms.
Tasks
Two splits following the original paper:
- lite: 30 tasks (curated subset for quick evaluation)
- full: 200 tasks (complete benchmark)
Each task provides a repository, base commit, feature specification, and held-out test files. Tasks span 24 open-source Python repositories including transformers, pandas, mlflow, astropy, scikit-learn, pytorch-lightning, and more.
Reward Structure
Multi-turn environment with binary reward:
- 1.0 — All FAIL_TO_PASS tests pass and all PASS_TO_PASS tests remain passing (resolved)
- 0.0 — Any required test fails or regresses
On submission, the environment restores held-out test files via the test patch, runs pytest, and checks both FAIL_TO_PASS (feature correctness) and PASS_TO_PASS (no regressions). The F2P pass rate is reported in metadata for finer-grained analysis.
Data
Source: LiberCoders/FeatureBench on HuggingFace. ~6 MB total across both splits. Each task includes the instance ID, repository name, base commit, gold patch, test patch, problem statement, F2P/P2P test file lists, Docker image name, and per-repo test settings.
Tools
| Tool | Description |
|---|---|
bash | Run bash commands in the sandbox container |
str_replace | Replace a unique string in a file with another string |
view | View file contents or directory listings |
create_file | Create a new file with specified content |
submit | Submit the solution — restores test files, runs pytest, returns reward |
Time Horizon
Multi-turn. Agents explore the repository, read code, implement the feature across one or more files, and submit. Typical tasks may involve 10–50+ tool calls depending on complexity.
Environment Difficulty
FeatureBench is significantly harder than SWE-bench. As of the paper's publication, the best agent (Claude 4.5 Opus + OpenHands) achieves only 10.5% resolved on the full split, compared to 74.4% on SWE-bench Verified.
Safety
Agents operate within sandboxed Docker containers. The environment does not involve private data or production systems. Test files are restored automatically at submission time and cannot be tampered with by the agent.
Citations
@inproceedings{zhou2026featurebench,
title={FeatureBench: Benchmarking Agentic Coding for Complex Feature Development},
author={Zhou, Qixing and Zhang, Jiacheng and Wang, Haiyang and Hao, Rui and Wang, Jiahe and Han, Minghao and Yang, Yuxue and Wu, Shuzhe and Pan, Feiyang and Fan, Lue and others},
booktitle={International Conference on Learning Representations},
year={2026},
url={https://arxiv.org/abs/2602.10975},
}