evoeval

API Endpoint
Leaderboard
Loading leaderboard...
README

EvoEval

⭐ OpenReward Environment

Description

EvoEval is an environment for evaluating code generation using evolved programming benchmarks. Created by transforming existing benchmarks (like HumanEval) into novel variants through targeted evolutions, it tests whether models can generalize to new problem formulations rather than relying on memorized solutions. Problems include rewording, subtle specification changes, and composition of multiple sub-problems.

This OpenReward implementation is ported from the Harbor Framework implementation originally made by digitsisyph.

Capabilities

  • Solving Python programming problems with evolved specifications
  • Generalizing beyond memorized benchmark solutions
  • Handling rewording, compositional, and creative problem variants
  • Writing correct, tested Python code

Compute Requirements

Agents are given a sandboxed environment with bash access and file editing tools. Default sandbox size is 1 CPU and 2 GB RAM.

License

Apache 2.0.

Tasks

There is one split in this environment:

  • Test: 100 evolved programming tasks

Tasks are Python function implementation problems evolved from standard benchmarks to test genuine coding proficiency rather than benchmark memorization.

Reward Structure

This is a multi-turn, sandbox-based environment. The agent implements a Python function in /app/solution.py, then calls submit_answer to trigger pytest execution against hidden test cases.

  • 1.0: All test cases pass.
  • 0.0: Any test case fails or solution is missing.

Data

Each task directory contains an instruction.md with the problem specification and a tests/ directory with test scripts. Task data is stored on the OpenReward platform.

Tools

ToolDescription
bashExecute shell commands in the sandbox.
str_replaceReplace a unique string in a file.
viewView file contents or list directory contents.
create_fileCreate a new file with specified content.
submit_answerSubmit work for automated test execution.

Time Horizon

EvoEval is a multi-turn environment. Agents read the problem specification, implement a solution, test it, and submit for verification.

Environment Difficulty

The original paper evaluates 51 LLMs and finds an average 39.4% performance drop compared to HumanEval:

ModelEvoEval AvgDifficultCreativeCombine
GPT-466.2%52%66%53%
GPT-4-Turbo65.1%50%61%45%
Claude-362.9%50%-42%

Performance drops range from 19.6% to 47.7% across models, revealing potential overfitting to existing benchmarks. Claude-3.5 excels on difficult and combine problems, while GPT-4-Turbo performs better on tool use and creative tasks.

Other Environment Requirements

There are no further environment requirements; EvoEval works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in EvoEval write and execute Python code in a sandboxed environment. The environment does not present direct safety risks.

Citations

@article{xia2024evoeval,
  author    = {Chunqiu Steven Xia and Yinlin Deng and Lingming Zhang},
  title     = {Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM},
  journal   = {arXiv preprint arXiv:2403.19114},
  year      = {2024},
  url       = {https://arxiv.org/abs/2403.19114}
}
GeneralReasoning/evoeval | OpenReward