DevEval

Description

DevEval is an environment for evaluating repository-level code generation aligned with real-world software development. Tasks require agents to implement code within realistic project structures, including proper dependency management, adherence to architectural designs (UML diagrams), and passing test suites. Tasks span multiple programming languages including C++, Java, JavaScript, and Python.

This OpenReward implementation is ported from the Harbor Framework implementation originally made by Michael Yang.

Capabilities

Repository-level code generation across multiple languages
Implementing software from architectural specifications (UML diagrams)
Writing unit tests and acceptance tests
Working with build systems (Make, npm, Maven, pip)

Compute Requirements

Agents are given a sandboxed environment with bash access, file editing tools, and language-specific toolchains. Default sandbox size is 1 CPU and 2 GB RAM.

License

Apache 2.0.

Tasks

There is one split in this environment:

Test: 63 development tasks

Tasks cover multiple languages and phases: C++ implementation and testing, Java projects, JavaScript applications, and Python packages. Each task may focus on implementation, acceptance testing, or unit testing.

Reward Structure

This is a multi-turn, sandbox-based environment. The agent implements code iteratively using file editing and bash tools, then calls submit_answer to trigger the project's test suite.

1.0: All tests pass.
0.0: Any test fails or code does not compile.

Data

Each task directory contains an instruction.md with detailed specifications (including UML diagrams and pre-provided source code) and a tests/ directory with verification scripts. Task data is stored on the OpenReward platform.

Tools

Tool	Description
`bash`	Execute shell commands in the sandbox.
`str_replace`	Replace a unique string in a file.
`view`	View file contents or list directory contents.
`create_file`	Create a new file with specified content.
`submit_answer`	Submit work for automated test execution.

Time Horizon

DevEval is a multi-turn environment. Agents read project specifications, explore the existing codebase, implement solutions, build and test, and submit for verification.

Environment Difficulty

The original paper evaluates 8 LLMs on repository-level code generation:

Model	Pass@1
GPT-4-turbo	53.0%
DeepSeek Coder 33B	46.3%
GPT-3.5-turbo	44.5%
CodeLLaMa 13B	41.9%
StarCoder 2 15B	37.8%

GPT-4-turbo achieves only 53% on DevEval compared to 80% on HumanEval, demonstrating the gap between isolated function generation and real-world repository-level development.

Other Environment Requirements

There are no further environment requirements; DevEval works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in DevEval write and execute code in a sandboxed environment. The environment does not present direct safety risks.

Citations

@inproceedings{li2024deveval,
  author    = {Jia Li and Ge Li and Yunfei Zhao and Yongmin Li and Huanyu Liu and Hao Zhu and Lecheng Wang and Kaibo Liu and Zheng Fang and Lanshen Wang and Jiazheng Ding and Xuanming Zhang and Yuqi Zhu and Yihong Dong and Zhi Jin and Binhua Li and Fei Huang and Yongbin Li and Bin Gu and Mengfei Yang},
  title     = {DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},
  year      = {2024},
  url       = {https://arxiv.org/abs/2405.19856}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

deveval

GeneralReasoning/deveval

DevEval

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples