SWE-Bench Verified

Description

SWE-Bench Verified is an environment for evaluating software engineering capabilities on real-world GitHub issues. Based on the SWE-bench Verified dataset — a human-validated subset of SWE-bench — agents are given a problem statement describing a real GitHub issue and must navigate the codebase, understand the bug, and produce a correct patch that passes the test suite.

Capabilities

Software engineering and bug fixing
Navigating real-world Python codebases
Understanding and resolving GitHub issues
Writing patches that pass existing and withheld test suites

Compute Requirements

Agents are given a sandbox with 4 CPUs and 8 GB RAM, with repository-specific Docker images and conda environments.

License

MIT

Tasks

Three splits are available:

all: 500 tasks
mini: 50 tasks (curated subset)
hard_subset: 45 tasks (estimated >1 hour of SWE work)

Tasks span popular Python repositories including Django, scikit-learn, sympy, matplotlib, and more.

Reward Structure

SWE-Bench Verified uses a multi-turn reward structure. Agents edit code using bash commands and call answer when complete. The environment runs the test suite including withheld tests. The reward structure is binary:

1.0: All FAIL_TO_PASS tests now pass and PASS_TO_PASS tests still pass
0.0: Otherwise

Data

Task specifications are sourced from HuggingFace princeton-nlp/SWE-bench_Verified. Repository snapshots are pre-loaded in Docker images and stored on the OpenReward platform.

Tools

bash: Execute bash commands in the testbed conda environment
answer: Submit work for test execution and grading

Time Horizon

Multi-turn. Agents explore the codebase, identify the root cause, write a fix, and submit for evaluation.

Environment Difficulty

Model	Accuracy
Claude Opus 4.5	80.9%
Claude Opus 4.6	80.8%
Gemini 3.1 Pro	80.6%
MiniMax M2.5	80.2%
GPT-5.2 Thinking	80.0%

Other Environment Requirements

There are no further environment requirements; SWE-Bench Verified works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in SWE-Bench Verified edit code in sandboxed Docker containers. The environment does not present direct safety risks.

Citation

@inproceedings{jimenez2024swebench,
  title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
  author={Jimenez, Carlos E and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	4 vCPUs / 8 GB RAM

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000920
Total	$0.0001240

SWE-Bench-Verified

GeneralReasoning/SWE-Bench-Verified

SWE-Bench Verified

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Tools

Compute Configuration

Estimated Cost

Examples