NL2RepoBench
NL2RepoBench
Description
NL2RepoBench is an environment for evaluating whether coding agents can build complete, functional Python repositories from scratch given only a natural language specification. Agents start with an empty workspace and a detailed project spec, then must implement the entire project — source code, package configuration, and supporting files. The original repository's pytest suite is run against the generated code to produce a continuous reward signal.
Capabilities
- Complete repository generation from natural language specifications
- Multi-file Python project architecture and implementation
- Dependency management and package configuration
- Long-horizon planning and iterative development (~90 tool calls on average)
Compute Requirements
Each task runs in its own sandbox container (1 CPU / 2 GB RAM) with the task-specific Docker image from NL2RepoBench. Network access is enabled for dependency installation.
License
Tasks
There is one split:
- test: 103 tasks spanning diverse Python library domains (1 task,
arxiv-mcp-server, was excluded because no pre-built Docker image is available on GHCR)
Task categories include:
- System Tools (24)
- Data Analysis & Processing (18)
- Testing (13)
- Utility Libraries (11)
- Web Development (10)
- Networking Tools (9)
- Database Interaction (7)
- Machine Learning (7)
- Batch File Processing (5)
Each task provides a detailed natural language specification (avg ~18,800 tokens) describing the project's goals, architecture, API signatures, and implementation requirements.
Reward Structure
Continuous, verifiable reward based on pytest pass rate:
- Reward = min(passed_tests / total_tests, 1.0)
- Scale: 0.0 (no tests pass) to 1.0 (all tests pass)
- No LLM grader — purely execution-based against the original repository's pytest suite
During evaluation, the agent's package configuration files and test files are replaced with the originals from the reference repository. Only the agent's source code implementation is evaluated.
Data
Source: NL2RepoBench GitHub
Per-task data:
start.md— Natural language project specificationtest_commands.json— Shell commands for running teststest_files.json— Test file paths (replaced with originals during evaluation)test_case_count.txt— Total expected test cases
Per-task Docker images from ghcr.io/multimodal-art-projection/nl2repobench/ contain the original repository with all dependencies pre-installed.
Tools
- bash: Execute bash commands in the sandbox (write files, install packages, run local tests)
- submit: Run the official evaluation pipeline — replaces agent's config/tests with originals, runs pytest, returns score
Time Horizon
Multi-turn, long-horizon environment. The paper reports an average of ~90 interaction turns per task, with total conversation context reaching ~90,000 tokens. This is one of the most demanding environments for sustained autonomous coding.
Environment Difficulty
| Difficulty | LOC Range | Count | Best Model Performance |
|---|---|---|---|
| Easy | ≤1,500 | 26 | ~52% pass rate |
| Medium | 1,500–4,000 | 46 | ~45% pass rate |
| Hard | ≥4,000 | 32 | ~25% pass rate |
Overall best (Claude Code with Claude Sonnet 4.5): ~40% average pass rate. Only 3 out of 104 repositories are fully solved by any model.
Safety
Tasks involve implementing well-known open-source Python libraries in isolated sandbox environments. Agents cannot affect external systems. The generated code is evaluated only against pre-existing test suites in sandboxed containers.
Citations
@misc{ding2026nl2repobenchlonghorizonrepositorygeneration,
title={NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents},
author={Jingzhe Ding and Shengda Long and Changxin Pu and Huan Zhou and Hongwan Gao and Xiang Gao and Chao He and Yue Hou and Fei Hu and Zhaojian Li and Weiran Shi and Zaiyuan Wang and Daoguang Zan and Chenchen Zhang and Xiaoxu Zhang and Qizhi Chen and Xianfu Cheng and Bo Deng and Qingshui Gu and Kai Hua and Juntao Lin and Pai Liu and Mingchen Li and Xuanguang Pan and Zifan Peng and Yujia Qin and Yong Shan and Zhewen Tan and Weihao Xie and Zihan Wang and Yishuo Yuan and Jiayu Zhang and Enduo Zhao and Yunfei Zhao and He Zhu and Liya Zhu and Chenyang Zou and Ming Ding and Jianpeng Jiao and Jiaheng Liu and Minghao Liu and Qian Liu and Chongyang Tao and Jian Yang and Tong Yang and Zhaoxiang Zhang and Xinjie Chen and Wenhao Huang and Ge Zhang},
year={2026},
eprint={2512.12730},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.12730},
}