nl2repobench

Description

NL2Repo Bench (Natural Language to Repository Benchmark) is a benchmark for evaluating the long-horizon repository-generation ability of coding agents, measuring their capacity to sustain coherent reasoning, planning, and execution across extended interaction horizons. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
GeneralReasoningGeneralReasoning/NL2RepoBench
0
2 months ago
arXiv/nl2repobench | OpenReward