deveval

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

DevEval

⭐ OpenReward Environment

Description

DevEval is an environment for evaluating repository-level code generation aligned with real-world software development. Tasks require agents to implement code within realistic project structures, including proper dependency management, adherence to architectural designs (UML diagrams), and passing test suites. Tasks span multiple programming languages including C++, Java, JavaScript, and Python.

This OpenReward implementation is ported from the Harbor Framework implementation originally made by Michael Yang.

Capabilities

  • Repository-level code generation across multiple languages
  • Implementing software from architectural specifications (UML diagrams)
  • Writing unit tests and acceptance tests
  • Working with build systems (Make, npm, Maven, pip)

Compute Requirements

Agents are given a sandboxed environment with bash access, file editing tools, and language-specific toolchains. Default sandbox size is 1 CPU and 2 GB RAM.

License

Apache 2.0.

Tasks

There is one split in this environment:

  • Test: 63 development tasks

Tasks cover multiple languages and phases: C++ implementation and testing, Java projects, JavaScript applications, and Python packages. Each task may focus on implementation, acceptance testing, or unit testing.

Reward Structure

This is a multi-turn, sandbox-based environment. The agent implements code iteratively using file editing and bash tools, then calls submit_answer to trigger the project's test suite.

  • 1.0: All tests pass.
  • 0.0: Any test fails or code does not compile.

Data

Each task directory contains an instruction.md with detailed specifications (including UML diagrams and pre-provided source code) and a tests/ directory with verification scripts. Task data is stored on the OpenReward platform.

Tools

ToolDescription
bashExecute shell commands in the sandbox.
str_replaceReplace a unique string in a file.
viewView file contents or list directory contents.
create_fileCreate a new file with specified content.
submit_answerSubmit work for automated test execution.

Time Horizon

DevEval is a multi-turn environment. Agents read project specifications, explore the existing codebase, implement solutions, build and test, and submit for verification.

Environment Difficulty

The original paper evaluates 8 LLMs on repository-level code generation:

ModelPass@1
GPT-4-turbo53.0%
DeepSeek Coder 33B46.3%
GPT-3.5-turbo44.5%
CodeLLaMa 13B41.9%
StarCoder 2 15B37.8%

GPT-4-turbo achieves only 53% on DevEval compared to 80% on HumanEval, demonstrating the gap between isolated function generation and real-world repository-level development.

Other Environment Requirements

There are no further environment requirements; DevEval works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in DevEval write and execute code in a sandboxed environment. The environment does not present direct safety risks.

Citations

@inproceedings{li2024deveval,
  author    = {Jia Li and Ge Li and Yunfei Zhao and Yongmin Li and Huanyu Liu and Hao Zhu and Lecheng Wang and Kaibo Liu and Zheng Fang and Lanshen Wang and Jiazheng Ding and Xuanming Zhang and Yuqi Zhu and Yihong Dong and Zhi Jin and Binhua Li and Fei Huang and Yongbin Li and Bin Gu and Mengfei Yang},
  title     = {DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},
  year      = {2024},
  url       = {https://arxiv.org/abs/2405.19856}
}
GeneralReasoning/deveval | OpenReward