API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/featurebench

README

FeatureBench

Description

FeatureBench is an execution-based benchmark for evaluating AI coding agents on feature-level development tasks. Unlike SWE-bench which focuses on bug fixing, FeatureBench requires agents to implement complete, production-ready features in real-world Python repositories. Agents receive a detailed feature specification and must modify the source code so that held-out tests pass.

Capabilities

Reading and understanding large Python codebases
Implementing new features from natural language specifications
Multi-file code editing across complex repositories
Reasoning about test expectations and software architecture

Compute Requirements

Agents are given a sandboxed Docker environment with a pre-built instance image per task. Default sandbox size is 2 CPU and 4 GB RAM.

License

MIT. The underlying FeatureBench dataset is subject to its own license terms.

Tasks

Two splits following the original paper:

lite: 30 tasks (curated subset for quick evaluation)
full: 200 tasks (complete benchmark)

Each task provides a repository, base commit, feature specification, and held-out test files. Tasks span 24 open-source Python repositories including transformers, pandas, mlflow, astropy, scikit-learn, pytorch-lightning, and more.

Reward Structure

Multi-turn environment with binary reward:

1.0 — All FAIL_TO_PASS tests pass and all PASS_TO_PASS tests remain passing (resolved)
0.0 — Any required test fails or regresses

On submission, the environment restores held-out test files via the test patch, runs pytest, and checks both FAIL_TO_PASS (feature correctness) and PASS_TO_PASS (no regressions). The F2P pass rate is reported in metadata for finer-grained analysis.

Data

Source: LiberCoders/FeatureBench on HuggingFace. ~6 MB total across both splits. Each task includes the instance ID, repository name, base commit, gold patch, test patch, problem statement, F2P/P2P test file lists, Docker image name, and per-repo test settings.

Tools

Tool	Description
`bash`	Run bash commands in the sandbox container
`str_replace`	Replace a unique string in a file with another string
`view`	View file contents or directory listings
`create_file`	Create a new file with specified content
`submit`	Submit the solution — restores test files, runs pytest, returns reward

Time Horizon

Multi-turn. Agents explore the repository, read code, implement the feature across one or more files, and submit. Typical tasks may involve 10–50+ tool calls depending on complexity.

Environment Difficulty

FeatureBench is significantly harder than SWE-bench. As of the paper's publication, the best agent (Claude 4.5 Opus + OpenHands) achieves only 10.5% resolved on the full split, compared to 74.4% on SWE-bench Verified.

Safety

Agents operate within sandboxed Docker containers. The environment does not involve private data or production systems. Test files are restored automatically at submission time and cannot be tampered with by the agent.

Citations

@inproceedings{zhou2026featurebench,
  title={FeatureBench: Benchmarking Agentic Coding for Complex Feature Development},
  author={Zhou, Qixing and Zhang, Jiacheng and Wang, Haiyang and Hao, Rui and Wang, Jiahe and Han, Minghao and Yang, Yuxue and Wu, Shuzhe and Pan, Feiyang and Fan, Lue and others},
  booktitle={International Conference on Learning Representations},
  year={2026},
  url={https://arxiv.org/abs/2602.10975},
}

Repository

Source repository

EnvCommons/FeatureBench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	2 vCPUs / 4 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000460
Total	$0.0000780

Examples

5-minute session$0.0234

1-hour session$0.2808

FeatureBench

GeneralReasoning/FeatureBench

FeatureBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples