SWE-BenchPro

Description

SWE-Bench Pro is a contamination‑resistant benchmark for evaluating autonomous software engineering agents on realistic, long‑horizon, enterprise‑level development tasks that build on SWE‑BENCH best practices. It comprises 1,865 human‑verified problems from 41 actively maintained repositories—partitioned into public (11), held‑out (12) and commercial (18) sets—and features multi‑file, hours‑to‑days patches with contextual augmentation and clustered failure‑mode analysis.

Leaderboard
Loading leaderboard...
Implementations (1)
EnvironmentStarsLast Updated
GeneralReasoningGeneralReasoning/SWE-Bench-Pro
0
1 months ago
ScaleAI/SWE-BenchPro | OpenReward