SWE-Bench-Verified
SWE-Bench Verified
Description
SWE-Bench Verified is an environment for evaluating software engineering capabilities on real-world GitHub issues. Based on the SWE-bench Verified dataset — a human-validated subset of SWE-bench — agents are given a problem statement describing a real GitHub issue and must navigate the codebase, understand the bug, and produce a correct patch that passes the test suite.
Capabilities
- Software engineering and bug fixing
- Navigating real-world Python codebases
- Understanding and resolving GitHub issues
- Writing patches that pass existing and withheld test suites
Compute Requirements
Agents are given a sandbox with 4 CPUs and 8 GB RAM, with repository-specific Docker images and conda environments.
License
Tasks
Three splits are available:
- all: 500 tasks
- mini: 50 tasks (curated subset)
- hard_subset: 45 tasks (estimated >1 hour of SWE work)
Tasks span popular Python repositories including Django, scikit-learn, sympy, matplotlib, and more.
Reward Structure
SWE-Bench Verified uses a multi-turn reward structure. Agents edit code using bash commands and call answer when complete. The environment runs the test suite including withheld tests. The reward structure is binary:
- 1.0: All FAIL_TO_PASS tests now pass and PASS_TO_PASS tests still pass
- 0.0: Otherwise
Data
Task specifications are sourced from HuggingFace princeton-nlp/SWE-bench_Verified. Repository snapshots are pre-loaded in Docker images and stored on the OpenReward platform.
Tools
- bash: Execute bash commands in the testbed conda environment
- answer: Submit work for test execution and grading
Time Horizon
Multi-turn. Agents explore the codebase, identify the root cause, write a fix, and submit for evaluation.
Environment Difficulty
| Model | Accuracy |
|---|---|
| Claude Opus 4.5 | 80.9% |
| Claude Opus 4.6 | 80.8% |
| Gemini 3.1 Pro | 80.6% |
| MiniMax M2.5 | 80.2% |
| GPT-5.2 Thinking | 80.0% |
Other Environment Requirements
There are no further environment requirements; SWE-Bench Verified works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in SWE-Bench Verified edit code in sandboxed Docker containers. The environment does not present direct safety risks.
Citation
@inproceedings{jimenez2024swebench,
title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
author={Jimenez, Carlos E and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024}
}