API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/swe-gym

README

SWE-Gym

Description

SWE-Gym is a training and evaluation environment for software engineering agents. It contains 2,438 real-world Python task instances sourced from 11 popular open-source repositories (including django, flask, sympy, pandas, and others). Each task provides a codebase with an executable runtime, a natural language problem statement describing an issue, and unit tests that verify whether the issue has been resolved.

Capabilities

Real-world software engineering tasks on production Python codebases
Python bug fixing and feature implementation
Test-based verification of code changes
File viewing, editing, and creation within a repository sandbox
Bash command execution for codebase exploration and testing

Compute Requirements

Each agent is given an isolated Docker sandbox with 1 CPU and 2GB of RAM. Per-task Docker images are used, with pre-installed dependencies specific to each repository and version.

License

MIT.

Tasks

There are two splits in this environment:

all: 2,438 task instances spanning 11 Python repositories. This is the full SWE-Gym training set, excluding instances with missing Docker images.
lite: 230 curated task instances, a subset of the full set selected for higher quality and diversity. Also excludes instances with missing Docker images.

Each task provides:

A problem statement describing the issue to be fixed (from the original GitHub issue or pull request).
A codebase checked out at the relevant base commit in /testbed.
Unit tests (FAIL_TO_PASS and PASS_TO_PASS) that determine whether the fix is correct.

Reward Structure

Rewards are binary (1.0 or 0.0) and deterministic. When the agent calls the answer tool, the environment:

Extracts the git diff of all changes made to the codebase.
Runs the evaluation test suite using swebench.harness.grading.
Returns a reward of 1.0 if the issue is resolved (all FAIL_TO_PASS tests now pass and all PASS_TO_PASS tests still pass), and 0.0 otherwise.

No LLM graders are used for this environment.

Data

Task data is loaded at runtime from HuggingFace:

SWE-Gym/SWE-Gym for the full dataset.
SWE-Gym/SWE-Gym-Lite for the curated lite subset.

Instances whose Docker images are unavailable (36 instances) are automatically excluded.

Tools

Tool	Parameters	Description
`bash`	`command: str`	Execute bash commands in the sandbox (600s timeout). Runs within the testbed conda environment.
`view`	`path: str`, `start: int?`, `end: int?`	View file contents or a specific line range (1-indexed, inclusive).
`str_replace`	`path: str`, `old_str: str`, `new_str: str`	Replace all occurrences of a string in a file. Shows the resulting diff.
`insert`	`path: str`, `start: int`, `content: str`	Insert content at a given 1-indexed line number. Shows the resulting diff.
`create`	`path: str`, `content: str`	Create a new file with the given content.
`answer`	(none)	Extract the patch, run the test suite, and return the resolved status. Ends the episode.

Time Horizon

SWE-Gym is a multi-turn environment. The agent iteratively explores the codebase, identifies the root cause of the issue, implements a fix, and verifies it before submitting via the answer tool. The episode ends when answer is called.

Environment Difficulty

[Put environment difficulty here]

Other Environment Requirements

There are no external API keys required beyond OpenReward platform access. The per-task Docker images are managed by the OpenReward sandbox infrastructure.

Safety

Agents operate in isolated Docker sandboxes provisioned per task. Each sandbox is resource-limited (1 CPU, 2GB RAM) and network-restricted. The agent cannot affect the host system or other running environments.

Citation

@inproceedings{pan2025swegym,
  title={Training Software Engineering Agents and Verifiers with SWE-Gym},
  author={Pan, Jiayi and Wang, Xingyao and Neubig, Graham and Jaitly, Navdeep and Ji, Heng and Suhr, Alane and Zhang, Yizhe},
  booktitle={Proceedings of the 42nd International Conference on Machine Learning (ICML)},
  year={2025}
}

Repository

Source repository

GeneralReasoning/env-swe-gym

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	2 vCPUs / 4 GB RAM
Sandbox Machine	1 vCPU / 2 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000460
Sandbox	$0.0000230
Total	$0.0000690

Examples

5-minute session$0.0207

1-hour session$0.2484

SWE-Gym

jiayipan/SWE-Gym

SWE-Gym

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples