SWE-Perf
SWE-Perf
Description
SWE-Perf is a code performance optimization benchmark comprising 140 tasks derived from real-world performance-improving pull requests in popular GitHub repositories. An agent is placed in a repository sandbox and tasked with modifying code to improve execution speed. Performance is scored by measuring statistically significant speedup using the Mann-Whitney U test.
Capabilities
- Code performance optimization in real-world repositories
- Repository-level code exploration and editing
- Identifying and resolving performance bottlenecks
- Statistical performance measurement and verification
Compute Requirements
Each agent is given an isolated Docker sandbox with 4 CPUs and 16GB RAM. Per-instance Docker images provide pre-configured environments with the target repository and its dependencies already installed.
License
Tasks
There are 140 tasks in a single test split. Each task is derived from a performance-improving pull request in one of 9 popular open-source GitHub repositories. The agent receives a problem statement describing a performance issue and must modify the codebase in /testbed to improve execution speed.
Each task includes:
- A repository checked out at a specific base commit
- A problem statement describing the performance issue
- Performance test cases that measure execution time before and after the agent's changes
Reward Structure
Rewards are continuous in the range [0.0, 1.0] and are computed deterministically without an LLM grader.
When the agent calls the answer tool, the evaluation procedure is:
- Performance tests are run 20 times on the agent's modified code (model version).
- The code is reverted to the original base commit and the same performance tests are run 20 times (base version).
- For each test case, the Mann-Whitney U test (alpha=0.1) is used to compute the minimum statistically significant speedup by iteratively weakening the observed improvement until statistical significance is lost.
- The final reward is the minimum gain across all test cases, providing a conservative estimate of performance improvement.
A reward of 0.0 means no statistically significant speedup was detected. Higher values indicate greater verified speedup.
Data
Task data is loaded at runtime from the SWE-Perf/SWE-Perf HuggingFace dataset. No local data files are stored in this repository.
Tools
| Tool | Parameters | Description |
|---|---|---|
bash | command: str | Execute bash commands in the sandbox (600s timeout) |
view | path: str, start: int?, end: int? | View file contents with optional 1-indexed line range |
str_replace | path: str, old_str: str, new_str: str | Replace a string in a file |
insert | path: str, start: int, content: str | Insert content at a 1-indexed line number |
create | path: str, content: str | Create a new file with the given content |
answer | (none) | Submit work for evaluation. Runs performance benchmarks and ends the episode |
Time Horizon
SWE-Perf is a multi-turn environment. The agent iteratively explores the repository codebase, identifies performance bottlenecks, makes code edits, and optionally verifies changes before calling answer to submit for evaluation.
Environment Difficulty
The original paper evaluates models in file-level (oracle) and repo-level settings. Results show significant gaps between LLMs and expert-level optimization:
File-Level Setting (Apply % / Correctness % / Performance %):
| Model | Apply | Correctness | Performance |
|---|---|---|---|
| Expert | 100.0% | 100.0% | 10.85% |
| Gemini-2.5-Pro | 95.0% | 83.6% | 1.48% |
| Claude-4-opus | 85.7% | 78.6% | 1.28% |
| OpenAI-o3 | 78.6% | 76.4% | 1.37% |
| Claude-4-sonnet | 73.6% | 70.0% | 1.76% |
| Claude-3.7-sonnet | 66.4% | 61.4% | 1.24% |
| OpenAI-o1 | 66.4% | 63.6% | 0.41% |
| GPT-4o | 63.6% | 56.4% | 0.60% |
| DeepSeek-R1 | 55.7% | 51.4% | 0.90% |
Repo-Level Setting:
| Method | Apply | Correctness | Performance |
|---|---|---|---|
| Claude-3.7-sonnet (OpenHands) | 87.9% | 77.9% | 2.26% |
| Claude-3.7-sonnet (Agentless) | 88.6% | 70.7% | 0.41% |
Other Environment Requirements
There are no external API keys or secrets required beyond access to the OpenReward platform.
Safety
Agents operate within an isolated Docker sandbox with a per-instance container image. The sandbox provides no network access to external systems. All code execution is confined to the container environment.
Citations
@article{he2025sweperf,
title={SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?},
author={He, Xinyi and Liu, Qian and Du, Mingzhe and Yan, Lin and Fan, Zhijie and Huang, Yiming and Yuan, Zejian and Ma, Zejun},
journal={arXiv preprint arXiv:2507.12415},
year={2025}
}