SWE-Perf

Description

SWE-Perf is a code performance optimization benchmark comprising 140 tasks derived from real-world performance-improving pull requests in popular GitHub repositories. An agent is placed in a repository sandbox and tasked with modifying code to improve execution speed. Performance is scored by measuring statistically significant speedup using the Mann-Whitney U test.

Capabilities

Code performance optimization in real-world repositories
Repository-level code exploration and editing
Identifying and resolving performance bottlenecks
Statistical performance measurement and verification

Compute Requirements

Each agent is given an isolated Docker sandbox with 4 CPUs and 16GB RAM. Per-instance Docker images provide pre-configured environments with the target repository and its dependencies already installed.

License

Apache 2.0.

Tasks

There are 140 tasks in a single test split. Each task is derived from a performance-improving pull request in one of 9 popular open-source GitHub repositories. The agent receives a problem statement describing a performance issue and must modify the codebase in /testbed to improve execution speed.

Each task includes:

A repository checked out at a specific base commit
A problem statement describing the performance issue
Performance test cases that measure execution time before and after the agent's changes

Reward Structure

Rewards are continuous in the range [0.0, 1.0] and are computed deterministically without an LLM grader.

When the agent calls the answer tool, the evaluation procedure is:

Performance tests are run 20 times on the agent's modified code (model version).
The code is reverted to the original base commit and the same performance tests are run 20 times (base version).
For each test case, the Mann-Whitney U test (alpha=0.1) is used to compute the minimum statistically significant speedup by iteratively weakening the observed improvement until statistical significance is lost.
The final reward is the minimum gain across all test cases, providing a conservative estimate of performance improvement.

A reward of 0.0 means no statistically significant speedup was detected. Higher values indicate greater verified speedup.

Data

Task data is loaded at runtime from the SWE-Perf/SWE-Perf HuggingFace dataset. No local data files are stored in this repository.

Tools

Tool	Parameters	Description
`bash`	`command: str`	Execute bash commands in the sandbox (600s timeout)
`view`	`path: str`, `start: int?`, `end: int?`	View file contents with optional 1-indexed line range
`str_replace`	`path: str`, `old_str: str`, `new_str: str`	Replace a string in a file
`insert`	`path: str`, `start: int`, `content: str`	Insert content at a 1-indexed line number
`create`	`path: str`, `content: str`	Create a new file with the given content
`answer`	(none)	Submit work for evaluation. Runs performance benchmarks and ends the episode

Time Horizon

SWE-Perf is a multi-turn environment. The agent iteratively explores the repository codebase, identifies performance bottlenecks, makes code edits, and optionally verifies changes before calling answer to submit for evaluation.

Environment Difficulty

The original paper evaluates models in file-level (oracle) and repo-level settings. Results show significant gaps between LLMs and expert-level optimization:

File-Level Setting (Apply % / Correctness % / Performance %):

Model	Apply	Correctness	Performance
Expert	100.0%	100.0%	10.85%
Gemini-2.5-Pro	95.0%	83.6%	1.48%
Claude-4-opus	85.7%	78.6%	1.28%
OpenAI-o3	78.6%	76.4%	1.37%
Claude-4-sonnet	73.6%	70.0%	1.76%
Claude-3.7-sonnet	66.4%	61.4%	1.24%
OpenAI-o1	66.4%	63.6%	0.41%
GPT-4o	63.6%	56.4%	0.60%
DeepSeek-R1	55.7%	51.4%	0.90%

Repo-Level Setting:

Method	Apply	Correctness	Performance
Claude-3.7-sonnet (OpenHands)	87.9%	77.9%	2.26%
Claude-3.7-sonnet (Agentless)	88.6%	70.7%	0.41%

Other Environment Requirements

There are no external API keys or secrets required beyond access to the OpenReward platform.

Safety

Agents operate within an isolated Docker sandbox with a per-instance container image. The sandbox provides no network access to external systems. All code execution is confined to the container environment.

Citations

@article{he2025sweperf,
  title={SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?},
  author={He, Xinyi and Liu, Qian and Du, Mingzhe and Yan, Lin and Fan, Zhijie and Huang, Yiming and Yuan, Zejian and Ma, Zejun},
  journal={arXiv preprint arXiv:2507.12415},
  year={2025}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	4 vCPUs / 16 GB RAM

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0001280
Total	$0.0001600

SWE-Perf

GeneralReasoning/SWE-Perf

SWE-Perf

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples