API Endpoint

Leaderboard

Loading leaderboard...

README

EvoEval

Description

EvoEval is an environment for evaluating code generation using evolved programming benchmarks. Created by transforming existing benchmarks (like HumanEval) into novel variants through targeted evolutions, it tests whether models can generalize to new problem formulations rather than relying on memorized solutions. Problems include rewording, subtle specification changes, and composition of multiple sub-problems.

This OpenReward implementation is ported from the Harbor Framework implementation originally made by digitsisyph.

Capabilities

Solving Python programming problems with evolved specifications
Generalizing beyond memorized benchmark solutions
Handling rewording, compositional, and creative problem variants
Writing correct, tested Python code

Compute Requirements

Agents are given a sandboxed environment with bash access and file editing tools. Default sandbox size is 1 CPU and 2 GB RAM.

License

Apache 2.0.

Tasks

There is one split in this environment:

Test: 100 evolved programming tasks

Tasks are Python function implementation problems evolved from standard benchmarks to test genuine coding proficiency rather than benchmark memorization.

Reward Structure

This is a multi-turn, sandbox-based environment. The agent implements a Python function in /app/solution.py, then calls submit_answer to trigger pytest execution against hidden test cases.

1.0: All test cases pass.
0.0: Any test case fails or solution is missing.

Data

Each task directory contains an instruction.md with the problem specification and a tests/ directory with test scripts. Task data is stored on the OpenReward platform.

Tools

Tool	Description
`bash`	Execute shell commands in the sandbox.
`str_replace`	Replace a unique string in a file.
`view`	View file contents or list directory contents.
`create_file`	Create a new file with specified content.
`submit_answer`	Submit work for automated test execution.

Time Horizon

EvoEval is a multi-turn environment. Agents read the problem specification, implement a solution, test it, and submit for verification.

Environment Difficulty

The original paper evaluates 51 LLMs and finds an average 39.4% performance drop compared to HumanEval:

Model	EvoEval Avg	Difficult	Creative	Combine
GPT-4	66.2%	52%	66%	53%
GPT-4-Turbo	65.1%	50%	61%	45%
Claude-3	62.9%	50%	-	42%

Performance drops range from 19.6% to 47.7% across models, revealing potential overfitting to existing benchmarks. Claude-3.5 excels on difficult and combine problems, while GPT-4-Turbo performs better on tool use and creative tasks.

Other Environment Requirements

There are no further environment requirements; EvoEval works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in EvoEval write and execute Python code in a sandboxed environment. The environment does not present direct safety risks.

Citations

@article{xia2024evoeval,
  author    = {Chunqiu Steven Xia and Yinlin Deng and Lingming Zhang},
  title     = {Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM},
  journal   = {arXiv preprint arXiv:2403.19114},
  year      = {2024},
  url       = {https://arxiv.org/abs/2403.19114}
}

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

Examples

5-minute session$0.0096

1-hour session$0.1152

evoeval

GeneralReasoning/evoeval

EvoEval

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples