API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/nl2repobench

README

NL2RepoBench

Description

NL2RepoBench is an environment for evaluating whether coding agents can build complete, functional Python repositories from scratch given only a natural language specification. Agents start with an empty workspace and a detailed project spec, then must implement the entire project — source code, package configuration, and supporting files. The original repository's pytest suite is run against the generated code to produce a continuous reward signal.

Capabilities

Complete repository generation from natural language specifications
Multi-file Python project architecture and implementation
Dependency management and package configuration
Long-horizon planning and iterative development (~90 tool calls on average)

Compute Requirements

Each task runs in its own sandbox container (1 CPU / 2 GB RAM) with the task-specific Docker image from NL2RepoBench. Network access is enabled for dependency installation.

License

Apache 2.0

Tasks

There is one split:

test: 103 tasks spanning diverse Python library domains (1 task, arxiv-mcp-server, was excluded because no pre-built Docker image is available on GHCR)

Task categories include:

System Tools (24)
Data Analysis & Processing (18)
Testing (13)
Utility Libraries (11)
Web Development (10)
Networking Tools (9)
Database Interaction (7)
Machine Learning (7)
Batch File Processing (5)

Each task provides a detailed natural language specification (avg ~18,800 tokens) describing the project's goals, architecture, API signatures, and implementation requirements.

Reward Structure

Continuous, verifiable reward based on pytest pass rate:

Reward = min(passed_tests / total_tests, 1.0)
Scale: 0.0 (no tests pass) to 1.0 (all tests pass)
No LLM grader — purely execution-based against the original repository's pytest suite

During evaluation, the agent's package configuration files and test files are replaced with the originals from the reference repository. Only the agent's source code implementation is evaluated.

Data

Source: NL2RepoBench GitHub

Per-task data:

start.md — Natural language project specification
test_commands.json — Shell commands for running tests
test_files.json — Test file paths (replaced with originals during evaluation)
test_case_count.txt — Total expected test cases

Per-task Docker images from ghcr.io/multimodal-art-projection/nl2repobench/ contain the original repository with all dependencies pre-installed.

Tools

bash: Execute bash commands in the sandbox (write files, install packages, run local tests)
submit: Run the official evaluation pipeline — replaces agent's config/tests with originals, runs pytest, returns score

Time Horizon

Multi-turn, long-horizon environment. The paper reports an average of ~90 interaction turns per task, with total conversation context reaching ~90,000 tokens. This is one of the most demanding environments for sustained autonomous coding.

Environment Difficulty

Difficulty	LOC Range	Count	Best Model Performance
Easy	≤1,500	26	~52% pass rate
Medium	1,500–4,000	46	~45% pass rate
Hard	≥4,000	32	~25% pass rate

Overall best (Claude Code with Claude Sonnet 4.5): ~40% average pass rate. Only 3 out of 104 repositories are fully solved by any model.

Safety

Tasks involve implementing well-known open-source Python libraries in isolated sandbox environments. Agents cannot affect external systems. The generated code is evaluated only against pre-existing test suites in sandboxed containers.

Citations

@misc{ding2026nl2repobenchlonghorizonrepositorygeneration,
      title={NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents},
      author={Jingzhe Ding and Shengda Long and Changxin Pu and Huan Zhou and Hongwan Gao and Xiang Gao and Chao He and Yue Hou and Fei Hu and Zhaojian Li and Weiran Shi and Zaiyuan Wang and Daoguang Zan and Chenchen Zhang and Xiaoxu Zhang and Qizhi Chen and Xianfu Cheng and Bo Deng and Qingshui Gu and Kai Hua and Juntao Lin and Pai Liu and Mingchen Li and Xuanguang Pan and Zifan Peng and Yujia Qin and Yong Shan and Zhewen Tan and Weihao Xie and Zihan Wang and Yishuo Yuan and Jiayu Zhang and Enduo Zhao and Yunfei Zhao and He Zhu and Liya Zhu and Chenyang Zou and Ming Ding and Jianpeng Jiao and Jiaheng Liu and Minghao Liu and Qian Liu and Chongyang Tao and Jian Yang and Tong Yang and Zhaoxiang Zhang and Xinjie Chen and Wenhao Huang and Ge Zhang},
      year={2026},
      eprint={2512.12730},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.12730},
}

Repository

Source repository

EnvCommons/NL2RepoBench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	1 vCPU / 2 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000230
Total	$0.0000550

Examples

5-minute session$0.0165

1-hour session$0.1980

NL2RepoBench

GeneralReasoning/NL2RepoBench

NL2RepoBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples