SWE-smith
SWE-smith
Description
SWE-smith is a multi-language synthetic software engineering benchmark. It contains 40,942 tasks drawn from 128+ GitHub repositories spanning Python, Java, Go, Rust, C++, C, C#, JavaScript, and PHP. Tasks are automatically synthesized by breaking existing tests in real-world repositories, creating realistic bug-fixing scenarios where agents must diagnose failures and produce correct patches.
Capabilities
- Multi-language software engineering across 9 programming languages
- Bug fixing in real-world open-source repositories
- Test-based verification with deterministic grading
- Cross-language code editing and debugging
- Codebase exploration and understanding in isolated sandboxes
Compute Requirements
Agents in SWE-smith are given a sandbox with 4 CPUs and 8GB of RAM. Each task uses a per-task Docker image containing the repository and its dependencies pre-installed.
License
MIT.
Tasks
There are three splits in this environment, all drawn from the train split of the HuggingFace dataset:
- all: 40,942 tasks across all supported languages and repositories.
- python: 39,310 tasks from Python repositories (131 repos).
- non-python: 1,632 tasks from non-Python repositories (Java, Go, Rust, C++, C, C#, JavaScript, PHP).
Tasks that have an empty problem statement in the HuggingFace dataset are excluded. Tasks using the project-monai_1776_monai image are also excluded due to an oversized Docker layer.
Each task presents the agent with a problem statement describing a failing test or broken behavior. The agent must explore the repository, identify the root cause, and produce a fix.
Reward Structure
SWE-smith uses binary rewards (1.0 or 0.0) with deterministic test-based grading. When the agent calls the answer tool, the environment extracts the agent's patch, reverts any changes to test files, and runs the full test suite. The resolution status is computed as follows:
- FULL: fail-to-pass = 100% AND pass-to-pass = 100%. Reward = 1.0.
- PARTIAL: 0% < fail-to-pass < 100% AND pass-to-pass = 100%. Reward = 0.0.
- NO: All other cases. Reward = 0.0.
Only FULL resolution receives a reward of 1.0. No LLM graders are used.
Data
Task data is loaded at runtime from the SWE-bench/SWE-smith HuggingFace dataset. Each task includes the instance ID, repository name, problem statement, gold patch, Docker image name, and lists of fail-to-pass and pass-to-pass tests.
Tools
Agents are given two tools:
| Tool | Parameters | Description |
|---|---|---|
bash | command: str | Execute bash commands in the sandbox. For Python tasks, commands run inside the testbed conda environment. 600-second timeout per command. |
answer | (none) | Extracts the agent's patch via git diff, reverts test file changes, runs the full test suite, and computes the resolution status. Ends the episode. |
Time Horizon
SWE-smith is a multi-turn environment. The agent iteratively explores the codebase using bash, makes edits to fix the bug, and then calls answer to submit its solution and receive a final reward.
Environment Difficulty
[Put environment difficulty here]
Other Environment Requirements
There are no external API keys needed beyond OpenReward platform access.
Safety
Agents in SWE-smith operate in isolated Docker sandboxes. Each task runs in its own container with limited resources (4 CPUs, 8GB RAM), preventing interference between tasks and containing any potentially destructive commands.
Citation
@article{yang2025swesmith,
title={SWE-smith: Scaling Data for Software Engineering Agents},
author={Yang, John and Lieret, Kilian and Jimenez, Carlos E. and Wettig, Alexander and Khandpur, Kabir and Zhang, Yanzhe and Hui, Binyuan and Press, Ofir and Schmidt, Ludwig and Yang, Diyi},
journal={arXiv preprint arXiv:2504.21798},
year={2025}
}