deveval
DevEval
Description
DevEval is an environment for evaluating repository-level code generation aligned with real-world software development. Tasks require agents to implement code within realistic project structures, including proper dependency management, adherence to architectural designs (UML diagrams), and passing test suites. Tasks span multiple programming languages including C++, Java, JavaScript, and Python.
This OpenReward implementation is ported from the Harbor Framework implementation originally made by Michael Yang.
Capabilities
- Repository-level code generation across multiple languages
- Implementing software from architectural specifications (UML diagrams)
- Writing unit tests and acceptance tests
- Working with build systems (Make, npm, Maven, pip)
Compute Requirements
Agents are given a sandboxed environment with bash access, file editing tools, and language-specific toolchains. Default sandbox size is 1 CPU and 2 GB RAM.
License
Tasks
There is one split in this environment:
- Test: 63 development tasks
Tasks cover multiple languages and phases: C++ implementation and testing, Java projects, JavaScript applications, and Python packages. Each task may focus on implementation, acceptance testing, or unit testing.
Reward Structure
This is a multi-turn, sandbox-based environment. The agent implements code iteratively using file editing and bash tools, then calls submit_answer to trigger the project's test suite.
- 1.0: All tests pass.
- 0.0: Any test fails or code does not compile.
Data
Each task directory contains an instruction.md with detailed specifications (including UML diagrams and pre-provided source code) and a tests/ directory with verification scripts. Task data is stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
bash | Execute shell commands in the sandbox. |
str_replace | Replace a unique string in a file. |
view | View file contents or list directory contents. |
create_file | Create a new file with specified content. |
submit_answer | Submit work for automated test execution. |
Time Horizon
DevEval is a multi-turn environment. Agents read project specifications, explore the existing codebase, implement solutions, build and test, and submit for verification.
Environment Difficulty
The original paper evaluates 8 LLMs on repository-level code generation:
| Model | Pass@1 |
|---|---|
| GPT-4-turbo | 53.0% |
| DeepSeek Coder 33B | 46.3% |
| GPT-3.5-turbo | 44.5% |
| CodeLLaMa 13B | 41.9% |
| StarCoder 2 15B | 37.8% |
GPT-4-turbo achieves only 53% on DevEval compared to 80% on HumanEval, demonstrating the gap between isolated function generation and real-world repository-level development.
Other Environment Requirements
There are no further environment requirements; DevEval works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in DevEval write and execute code in a sandboxed environment. The environment does not present direct safety risks.
Citations
@inproceedings{li2024deveval,
author = {Jia Li and Ge Li and Yunfei Zhao and Yongmin Li and Huanyu Liu and Hao Zhu and Lecheng Wang and Kaibo Liu and Zheng Fang and Lanshen Wang and Jiazheng Ding and Xuanming Zhang and Yuqi Zhu and Yihong Dong and Zhi Jin and Binhua Li and Fei Huang and Yongbin Li and Bin Gu and Mengfei Yang},
title = {DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},
year = {2024},
url = {https://arxiv.org/abs/2405.19856}
}