NL2RepoBench

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

NL2RepoBench

OpenReward Environment

Description

NL2RepoBench is an environment for evaluating whether coding agents can build complete, functional Python repositories from scratch given only a natural language specification. Agents start with an empty workspace and a detailed project spec, then must implement the entire project — source code, package configuration, and supporting files. The original repository's pytest suite is run against the generated code to produce a continuous reward signal.

Capabilities

  • Complete repository generation from natural language specifications
  • Multi-file Python project architecture and implementation
  • Dependency management and package configuration
  • Long-horizon planning and iterative development (~90 tool calls on average)

Compute Requirements

Each task runs in its own sandbox container (1 CPU / 2 GB RAM) with the task-specific Docker image from NL2RepoBench. Network access is enabled for dependency installation.

License

Apache 2.0

Tasks

There is one split:

  • test: 103 tasks spanning diverse Python library domains (1 task, arxiv-mcp-server, was excluded because no pre-built Docker image is available on GHCR)

Task categories include:

  • System Tools (24)
  • Data Analysis & Processing (18)
  • Testing (13)
  • Utility Libraries (11)
  • Web Development (10)
  • Networking Tools (9)
  • Database Interaction (7)
  • Machine Learning (7)
  • Batch File Processing (5)

Each task provides a detailed natural language specification (avg ~18,800 tokens) describing the project's goals, architecture, API signatures, and implementation requirements.

Reward Structure

Continuous, verifiable reward based on pytest pass rate:

  • Reward = min(passed_tests / total_tests, 1.0)
  • Scale: 0.0 (no tests pass) to 1.0 (all tests pass)
  • No LLM grader — purely execution-based against the original repository's pytest suite

During evaluation, the agent's package configuration files and test files are replaced with the originals from the reference repository. Only the agent's source code implementation is evaluated.

Data

Source: NL2RepoBench GitHub

Per-task data:

  • start.md — Natural language project specification
  • test_commands.json — Shell commands for running tests
  • test_files.json — Test file paths (replaced with originals during evaluation)
  • test_case_count.txt — Total expected test cases

Per-task Docker images from ghcr.io/multimodal-art-projection/nl2repobench/ contain the original repository with all dependencies pre-installed.

Tools

  • bash: Execute bash commands in the sandbox (write files, install packages, run local tests)
  • submit: Run the official evaluation pipeline — replaces agent's config/tests with originals, runs pytest, returns score

Time Horizon

Multi-turn, long-horizon environment. The paper reports an average of ~90 interaction turns per task, with total conversation context reaching ~90,000 tokens. This is one of the most demanding environments for sustained autonomous coding.

Environment Difficulty

DifficultyLOC RangeCountBest Model Performance
Easy≤1,50026~52% pass rate
Medium1,500–4,00046~45% pass rate
Hard≥4,00032~25% pass rate

Overall best (Claude Code with Claude Sonnet 4.5): ~40% average pass rate. Only 3 out of 104 repositories are fully solved by any model.

Safety

Tasks involve implementing well-known open-source Python libraries in isolated sandbox environments. Agents cannot affect external systems. The generated code is evaluated only against pre-existing test suites in sandboxed containers.

Citations

@misc{ding2026nl2repobenchlonghorizonrepositorygeneration,
      title={NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents},
      author={Jingzhe Ding and Shengda Long and Changxin Pu and Huan Zhou and Hongwan Gao and Xiang Gao and Chao He and Yue Hou and Fei Hu and Zhaojian Li and Weiran Shi and Zaiyuan Wang and Daoguang Zan and Chenchen Zhang and Xiaoxu Zhang and Qizhi Chen and Xianfu Cheng and Bo Deng and Qingshui Gu and Kai Hua and Juntao Lin and Pai Liu and Mingchen Li and Xuanguang Pan and Zifan Peng and Yujia Qin and Yong Shan and Zhewen Tan and Weihao Xie and Zihan Wang and Yishuo Yuan and Jiayu Zhang and Enduo Zhao and Yunfei Zhao and He Zhu and Liya Zhu and Chenyang Zou and Ming Ding and Jianpeng Jiao and Jiaheng Liu and Minghao Liu and Qian Liu and Chongyang Tao and Jian Yang and Tong Yang and Zhaoxiang Zhang and Xinjie Chen and Wenhao Huang and Ge Zhang},
      year={2026},
      eprint={2512.12730},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.12730},
}
GeneralReasoning/NL2RepoBench | OpenReward