FeatureBench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

FeatureBench

OpenReward Environment

Description

FeatureBench is an execution-based benchmark for evaluating AI coding agents on feature-level development tasks. Unlike SWE-bench which focuses on bug fixing, FeatureBench requires agents to implement complete, production-ready features in real-world Python repositories. Agents receive a detailed feature specification and must modify the source code so that held-out tests pass.

Capabilities

  • Reading and understanding large Python codebases
  • Implementing new features from natural language specifications
  • Multi-file code editing across complex repositories
  • Reasoning about test expectations and software architecture

Compute Requirements

Agents are given a sandboxed Docker environment with a pre-built instance image per task. Default sandbox size is 2 CPU and 4 GB RAM.

License

MIT. The underlying FeatureBench dataset is subject to its own license terms.

Tasks

Two splits following the original paper:

  • lite: 30 tasks (curated subset for quick evaluation)
  • full: 200 tasks (complete benchmark)

Each task provides a repository, base commit, feature specification, and held-out test files. Tasks span 24 open-source Python repositories including transformers, pandas, mlflow, astropy, scikit-learn, pytorch-lightning, and more.

Reward Structure

Multi-turn environment with binary reward:

  • 1.0 — All FAIL_TO_PASS tests pass and all PASS_TO_PASS tests remain passing (resolved)
  • 0.0 — Any required test fails or regresses

On submission, the environment restores held-out test files via the test patch, runs pytest, and checks both FAIL_TO_PASS (feature correctness) and PASS_TO_PASS (no regressions). The F2P pass rate is reported in metadata for finer-grained analysis.

Data

Source: LiberCoders/FeatureBench on HuggingFace. ~6 MB total across both splits. Each task includes the instance ID, repository name, base commit, gold patch, test patch, problem statement, F2P/P2P test file lists, Docker image name, and per-repo test settings.

Tools

ToolDescription
bashRun bash commands in the sandbox container
str_replaceReplace a unique string in a file with another string
viewView file contents or directory listings
create_fileCreate a new file with specified content
submitSubmit the solution — restores test files, runs pytest, returns reward

Time Horizon

Multi-turn. Agents explore the repository, read code, implement the feature across one or more files, and submit. Typical tasks may involve 10–50+ tool calls depending on complexity.

Environment Difficulty

FeatureBench is significantly harder than SWE-bench. As of the paper's publication, the best agent (Claude 4.5 Opus + OpenHands) achieves only 10.5% resolved on the full split, compared to 74.4% on SWE-bench Verified.

Safety

Agents operate within sandboxed Docker containers. The environment does not involve private data or production systems. Test files are restored automatically at submission time and cannot be tampered with by the agent.

Citations

@inproceedings{zhou2026featurebench,
  title={FeatureBench: Benchmarking Agentic Coding for Complex Feature Development},
  author={Zhou, Qixing and Zhang, Jiacheng and Wang, Haiyang and Hao, Rui and Wang, Jiahe and Han, Minghao and Yang, Yuxue and Wu, Shuzhe and Pan, Feiyang and Fan, Lue and others},
  booktitle={International Conference on Learning Representations},
  year={2026},
  url={https://arxiv.org/abs/2602.10975},
}
GeneralReasoning/FeatureBench | OpenReward