SWE-smith

Description

SWE-smith is a multi-language synthetic software engineering benchmark. It contains 40,942 tasks drawn from 128+ GitHub repositories spanning Python, Java, Go, Rust, C++, C, C#, JavaScript, and PHP. Tasks are automatically synthesized by breaking existing tests in real-world repositories, creating realistic bug-fixing scenarios where agents must diagnose failures and produce correct patches.

Capabilities

Multi-language software engineering across 9 programming languages
Bug fixing in real-world open-source repositories
Test-based verification with deterministic grading
Cross-language code editing and debugging
Codebase exploration and understanding in isolated sandboxes

Compute Requirements

Agents in SWE-smith are given a sandbox with 4 CPUs and 8GB of RAM. Each task uses a per-task Docker image containing the repository and its dependencies pre-installed.

License

MIT.

Tasks

There are three splits in this environment, all drawn from the train split of the HuggingFace dataset:

all: 40,942 tasks across all supported languages and repositories.
python: 39,310 tasks from Python repositories (131 repos).
non-python: 1,632 tasks from non-Python repositories (Java, Go, Rust, C++, C, C#, JavaScript, PHP).

Tasks that have an empty problem statement in the HuggingFace dataset are excluded. Tasks using the project-monai_1776_monai image are also excluded due to an oversized Docker layer.

Each task presents the agent with a problem statement describing a failing test or broken behavior. The agent must explore the repository, identify the root cause, and produce a fix.

Reward Structure

SWE-smith uses binary rewards (1.0 or 0.0) with deterministic test-based grading. When the agent calls the answer tool, the environment extracts the agent's patch, reverts any changes to test files, and runs the full test suite. The resolution status is computed as follows:

FULL: fail-to-pass = 100% AND pass-to-pass = 100%. Reward = 1.0.
PARTIAL: 0% < fail-to-pass < 100% AND pass-to-pass = 100%. Reward = 0.0.
NO: All other cases. Reward = 0.0.

Only FULL resolution receives a reward of 1.0. No LLM graders are used.

Data

Task data is loaded at runtime from the SWE-bench/SWE-smith HuggingFace dataset. Each task includes the instance ID, repository name, problem statement, gold patch, Docker image name, and lists of fail-to-pass and pass-to-pass tests.

Tools

Agents are given two tools:

Tool	Parameters	Description
`bash`	`command: str`	Execute bash commands in the sandbox. For Python tasks, commands run inside the `testbed` conda environment. 600-second timeout per command.
`answer`	(none)	Extracts the agent's patch via `git diff`, reverts test file changes, runs the full test suite, and computes the resolution status. Ends the episode.

Time Horizon

SWE-smith is a multi-turn environment. The agent iteratively explores the codebase using bash, makes edits to fix the bug, and then calls answer to submit its solution and receive a final reward.

Environment Difficulty

[Put environment difficulty here]

Other Environment Requirements

There are no external API keys needed beyond OpenReward platform access.

Safety

Agents in SWE-smith operate in isolated Docker sandboxes. Each task runs in its own container with limited resources (4 CPUs, 8GB RAM), preventing interference between tasks and containing any potentially destructive commands.

Citation

@article{yang2025swesmith,
  title={SWE-smith: Scaling Data for Software Engineering Agents},
  author={Yang, John and Lieret, Kilian and Jimenez, Carlos E. and Wettig, Alexander and Khandpur, Kabir and Zhang, Yanzhe and Hui, Binyuan and Press, Ofir and Schmidt, Ludwig and Yang, Diyi},
  journal={arXiv preprint arXiv:2504.21798},
  year={2025}
}

Component	Configuration
Environment Server	2 vCPUs / 4 GB RAM
Sandbox Machine	4 vCPUs / 8 GB RAM

Component	Cost / second
Environment	$0.0000460
Sandbox	$0.0000920
Total	$0.0001380

SWE-smith

GeneralReasoning/SWE-smith

SWE-smith

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Tools

Compute Configuration

Estimated Cost

Examples