SWE-Perf

API Endpoint
Leaderboard
Loading leaderboard...
Implementation of
README

SWE-Perf

OpenReward Environment Hugging Face Dataset

Description

SWE-Perf is a code performance optimization benchmark comprising 140 tasks derived from real-world performance-improving pull requests in popular GitHub repositories. An agent is placed in a repository sandbox and tasked with modifying code to improve execution speed. Performance is scored by measuring statistically significant speedup using the Mann-Whitney U test.

Capabilities

  • Code performance optimization in real-world repositories
  • Repository-level code exploration and editing
  • Identifying and resolving performance bottlenecks
  • Statistical performance measurement and verification

Compute Requirements

Each agent is given an isolated Docker sandbox with 4 CPUs and 16GB RAM. Per-instance Docker images provide pre-configured environments with the target repository and its dependencies already installed.

License

Apache 2.0.

Tasks

There are 140 tasks in a single test split. Each task is derived from a performance-improving pull request in one of 9 popular open-source GitHub repositories. The agent receives a problem statement describing a performance issue and must modify the codebase in /testbed to improve execution speed.

Each task includes:

  • A repository checked out at a specific base commit
  • A problem statement describing the performance issue
  • Performance test cases that measure execution time before and after the agent's changes

Reward Structure

Rewards are continuous in the range [0.0, 1.0] and are computed deterministically without an LLM grader.

When the agent calls the answer tool, the evaluation procedure is:

  1. Performance tests are run 20 times on the agent's modified code (model version).
  2. The code is reverted to the original base commit and the same performance tests are run 20 times (base version).
  3. For each test case, the Mann-Whitney U test (alpha=0.1) is used to compute the minimum statistically significant speedup by iteratively weakening the observed improvement until statistical significance is lost.
  4. The final reward is the minimum gain across all test cases, providing a conservative estimate of performance improvement.

A reward of 0.0 means no statistically significant speedup was detected. Higher values indicate greater verified speedup.

Data

Task data is loaded at runtime from the SWE-Perf/SWE-Perf HuggingFace dataset. No local data files are stored in this repository.

Tools

ToolParametersDescription
bashcommand: strExecute bash commands in the sandbox (600s timeout)
viewpath: str, start: int?, end: int?View file contents with optional 1-indexed line range
str_replacepath: str, old_str: str, new_str: strReplace a string in a file
insertpath: str, start: int, content: strInsert content at a 1-indexed line number
createpath: str, content: strCreate a new file with the given content
answer(none)Submit work for evaluation. Runs performance benchmarks and ends the episode

Time Horizon

SWE-Perf is a multi-turn environment. The agent iteratively explores the repository codebase, identifies performance bottlenecks, makes code edits, and optionally verifies changes before calling answer to submit for evaluation.

Environment Difficulty

The original paper evaluates models in file-level (oracle) and repo-level settings. Results show significant gaps between LLMs and expert-level optimization:

File-Level Setting (Apply % / Correctness % / Performance %):

ModelApplyCorrectnessPerformance
Expert100.0%100.0%10.85%
Gemini-2.5-Pro95.0%83.6%1.48%
Claude-4-opus85.7%78.6%1.28%
OpenAI-o378.6%76.4%1.37%
Claude-4-sonnet73.6%70.0%1.76%
Claude-3.7-sonnet66.4%61.4%1.24%
OpenAI-o166.4%63.6%0.41%
GPT-4o63.6%56.4%0.60%
DeepSeek-R155.7%51.4%0.90%

Repo-Level Setting:

MethodApplyCorrectnessPerformance
Claude-3.7-sonnet (OpenHands)87.9%77.9%2.26%
Claude-3.7-sonnet (Agentless)88.6%70.7%0.41%

Other Environment Requirements

There are no external API keys or secrets required beyond access to the OpenReward platform.

Safety

Agents operate within an isolated Docker sandbox with a per-instance container image. The sandbox provides no network access to external systems. All code execution is confined to the container environment.

Citations

@article{he2025sweperf,
  title={SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?},
  author={He, Xinyi and Liu, Qian and Du, Mingzhe and Yan, Lin and Fan, Zhijie and Huang, Yiming and Yuan, Zejian and Ma, Zejun},
  journal={arXiv preprint arXiv:2507.12415},
  year={2025}
}
GeneralReasoning/SWE-Perf | OpenReward