codereval

Description

CoderEval is a benchmark for evaluating code generation models that consists of 230 Python and 230 Java tasks carefully curated from real-world open-source projects plus a self-contained execution platform to automatically assess functional correctness. It supports six levels of context dependency (code elements defined outside the target function such as types, APIs, variables, and constants) to measure models' ability to generate non-standalone functions beyond standalone scenarios.

Leaderboard
Loading leaderboard...
Implementations

No implementations linked yet. Add one to showcase related work.

arXiv/codereval | OpenReward