evoeval
EvoEval
Description
EvoEval is an environment for evaluating code generation using evolved programming benchmarks. Created by transforming existing benchmarks (like HumanEval) into novel variants through targeted evolutions, it tests whether models can generalize to new problem formulations rather than relying on memorized solutions. Problems include rewording, subtle specification changes, and composition of multiple sub-problems.
This OpenReward implementation is ported from the Harbor Framework implementation originally made by digitsisyph.
Capabilities
- Solving Python programming problems with evolved specifications
- Generalizing beyond memorized benchmark solutions
- Handling rewording, compositional, and creative problem variants
- Writing correct, tested Python code
Compute Requirements
Agents are given a sandboxed environment with bash access and file editing tools. Default sandbox size is 1 CPU and 2 GB RAM.
License
Tasks
There is one split in this environment:
- Test: 100 evolved programming tasks
Tasks are Python function implementation problems evolved from standard benchmarks to test genuine coding proficiency rather than benchmark memorization.
Reward Structure
This is a multi-turn, sandbox-based environment. The agent implements a Python function in /app/solution.py, then calls submit_answer to trigger pytest execution against hidden test cases.
- 1.0: All test cases pass.
- 0.0: Any test case fails or solution is missing.
Data
Each task directory contains an instruction.md with the problem specification and a tests/ directory with test scripts. Task data is stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
bash | Execute shell commands in the sandbox. |
str_replace | Replace a unique string in a file. |
view | View file contents or list directory contents. |
create_file | Create a new file with specified content. |
submit_answer | Submit work for automated test execution. |
Time Horizon
EvoEval is a multi-turn environment. Agents read the problem specification, implement a solution, test it, and submit for verification.
Environment Difficulty
The original paper evaluates 51 LLMs and finds an average 39.4% performance drop compared to HumanEval:
| Model | EvoEval Avg | Difficult | Creative | Combine |
|---|---|---|---|---|
| GPT-4 | 66.2% | 52% | 66% | 53% |
| GPT-4-Turbo | 65.1% | 50% | 61% | 45% |
| Claude-3 | 62.9% | 50% | - | 42% |
Performance drops range from 19.6% to 47.7% across models, revealing potential overfitting to existing benchmarks. Claude-3.5 excels on difficult and combine problems, while GPT-4-Turbo performs better on tool use and creative tasks.
Other Environment Requirements
There are no further environment requirements; EvoEval works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in EvoEval write and execute Python code in a sandboxed environment. The environment does not present direct safety risks.
Citations
@article{xia2024evoeval,
author = {Chunqiu Steven Xia and Yinlin Deng and Lingming Zhang},
title = {Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM},
journal = {arXiv preprint arXiv:2403.19114},
year = {2024},
url = {https://arxiv.org/abs/2403.19114}
}