swe-gym-lite

API Endpoint
Leaderboard
Loading leaderboard...
README

SWE-Gym-Lite

OpenReward Environment

Description

SWE-Gym-Lite is an environment for evaluating agents on real-world software engineering tasks. Based on SWE-Gym, agents are given GitHub issues from popular Python repositories and must modify the codebase to resolve them. Each task includes a sandboxed repository with pre-installed dependencies and executable test verification.

This OpenReward implementation is ported from the Harbor Framework implementation originally made by tangken333.

Capabilities

  • Resolving real-world GitHub issues
  • Understanding and navigating large codebases
  • Writing and modifying Python code
  • Debugging and test-driven development

Compute Requirements

Agents are given a sandboxed Docker environment. Default sandbox size is 1 CPU and 2 GB RAM.

License

MIT.

Tasks

There is one split in this environment:

  • Test: 185 GitHub issues across 10 Python repositories

Tasks span the following repositories:

  • mypy (41 tasks): Python static type checker
  • moto (36 tasks): AWS service mocking library
  • dvc (29 tasks): Data version control
  • monai (27 tasks): Medical imaging deep learning
  • pydantic (19 tasks): Data validation library
  • conan (11 tasks): C/C++ package manager
  • dask (10 tasks): Parallel computing library
  • hydra (9 tasks): Configuration framework
  • pandas (2 tasks): Data analysis library
  • bokeh (1 task): Interactive visualization library

Each task presents a GitHub issue with a bug report or feature request. The agent must understand the issue, locate the relevant code, and submit a patch that passes the repository's test suite.

Reward Structure

This is a multi-turn environment with binary reward:

  • 1.0 — All relevant tests pass after applying the agent's patch
  • 0.0 — Tests fail or patch cannot be applied

Verification follows the SWE-Bench evaluation protocol. The test harness applies the agent's solution patch, runs the repository's test suite on the affected tests, and checks that:

  1. Tests that were failing before the fix now pass (FAIL_TO_PASS)
  2. Tests that were passing before remain passing (PASS_TO_PASS)

Data

Data consists of 185 task directories, each containing an instruction file describing the GitHub issue, solution files for oracle verification, and a test harness. Tasks are derived from the SWE-Gym Lite split.

Tools

ToolDescription
bashRun bash commands in the sandbox container.
str_replaceReplace a unique string in a file with another string.
viewView file contents or directory listings.
create_fileCreate a new file with specified content.
submit_answerSubmit work for verification. Runs the test harness and returns reward.

Time Horizon

SWE-Gym-Lite is a multi-turn environment. Agents explore the repository, understand the bug, locate relevant code, implement a fix, and verify with tests before submitting.

Environment Difficulty

The SWE-Gym paper (ICML 2025) evaluates agents on SWE-Bench Verified and Lite:

AgentSWE-Bench VerifiedSWE-Bench Lite
SWE-Gym Fine-tuned (32B)32.0%26.0%
GPT-4o (OpenHands)21.8%18.4%
Claude 3.5 Sonnet (OpenHands)30.2%24.8%

Fine-tuning on SWE-Gym trajectories yields up to +14% absolute gains over base agent performance.

Other Environment Requirements

There are no external API key requirements; SWE-Gym-Lite works out of the box with the OpenReward endpoint.

Safety

Agents in SWE-Gym-Lite modify code within isolated Docker containers. The environment does not involve production systems or external network access beyond the sandbox.

Citations

@inproceedings{pan2025swegym,
  author    = {Jiayi Pan and Xingyao Wang and Graham Neubig and Navdeep Jaitly and Heng Ji and Alane Suhr and Yizhe Zhang},
  title     = {Training Software Engineering Agents and Verifiers with SWE-Gym},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year      = {2025},
  url       = {https://arxiv.org/abs/2412.21139}
}
GeneralReasoning/swe-gym-lite | OpenReward