swe-gym-lite
SWE-Gym-Lite
Description
SWE-Gym-Lite is an environment for evaluating agents on real-world software engineering tasks. Based on SWE-Gym, agents are given GitHub issues from popular Python repositories and must modify the codebase to resolve them. Each task includes a sandboxed repository with pre-installed dependencies and executable test verification.
This OpenReward implementation is ported from the Harbor Framework implementation originally made by tangken333.
Capabilities
- Resolving real-world GitHub issues
- Understanding and navigating large codebases
- Writing and modifying Python code
- Debugging and test-driven development
Compute Requirements
Agents are given a sandboxed Docker environment. Default sandbox size is 1 CPU and 2 GB RAM.
License
MIT.
Tasks
There is one split in this environment:
- Test: 185 GitHub issues across 10 Python repositories
Tasks span the following repositories:
- mypy (41 tasks): Python static type checker
- moto (36 tasks): AWS service mocking library
- dvc (29 tasks): Data version control
- monai (27 tasks): Medical imaging deep learning
- pydantic (19 tasks): Data validation library
- conan (11 tasks): C/C++ package manager
- dask (10 tasks): Parallel computing library
- hydra (9 tasks): Configuration framework
- pandas (2 tasks): Data analysis library
- bokeh (1 task): Interactive visualization library
Each task presents a GitHub issue with a bug report or feature request. The agent must understand the issue, locate the relevant code, and submit a patch that passes the repository's test suite.
Reward Structure
This is a multi-turn environment with binary reward:
- 1.0 — All relevant tests pass after applying the agent's patch
- 0.0 — Tests fail or patch cannot be applied
Verification follows the SWE-Bench evaluation protocol. The test harness applies the agent's solution patch, runs the repository's test suite on the affected tests, and checks that:
- Tests that were failing before the fix now pass (FAIL_TO_PASS)
- Tests that were passing before remain passing (PASS_TO_PASS)
Data
Data consists of 185 task directories, each containing an instruction file describing the GitHub issue, solution files for oracle verification, and a test harness. Tasks are derived from the SWE-Gym Lite split.
Tools
| Tool | Description |
|---|---|
bash | Run bash commands in the sandbox container. |
str_replace | Replace a unique string in a file with another string. |
view | View file contents or directory listings. |
create_file | Create a new file with specified content. |
submit_answer | Submit work for verification. Runs the test harness and returns reward. |
Time Horizon
SWE-Gym-Lite is a multi-turn environment. Agents explore the repository, understand the bug, locate relevant code, implement a fix, and verify with tests before submitting.
Environment Difficulty
The SWE-Gym paper (ICML 2025) evaluates agents on SWE-Bench Verified and Lite:
| Agent | SWE-Bench Verified | SWE-Bench Lite |
|---|---|---|
| SWE-Gym Fine-tuned (32B) | 32.0% | 26.0% |
| GPT-4o (OpenHands) | 21.8% | 18.4% |
| Claude 3.5 Sonnet (OpenHands) | 30.2% | 24.8% |
Fine-tuning on SWE-Gym trajectories yields up to +14% absolute gains over base agent performance.
Other Environment Requirements
There are no external API key requirements; SWE-Gym-Lite works out of the box with the OpenReward endpoint.
Safety
Agents in SWE-Gym-Lite modify code within isolated Docker containers. The environment does not involve production systems or external network access beyond the sandbox.
Citations
@inproceedings{pan2025swegym,
author = {Jiayi Pan and Xingyao Wang and Graham Neubig and Navdeep Jaitly and Heng Ji and Alane Suhr and Yizhe Zhang},
title = {Training Software Engineering Agents and Verifiers with SWE-Gym},
booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
year = {2025},
url = {https://arxiv.org/abs/2412.21139}
}