SWE-Bench-Multilingual

API Endpoint
Leaderboard
Loading leaderboard...
README

SWE-Bench-Multilingual

OpenReward Environment Hugging Face Dataset

Description

SWE-Bench-Multilingual is an environment for evaluating code repair capabilities across multiple programming languages. Based on the SWE-bench methodology, agents are given real GitHub issues from multilingual repositories and must produce patches that resolve the issues while passing existing tests. The environment extends SWE-bench beyond Python to include repositories in multiple languages.

Capabilities

  • Multi-language code understanding and repair
  • GitHub issue resolution
  • Test-driven development
  • Codebase navigation and modification
  • Patch generation and validation

Compute Requirements

Agents are given a sandboxed environment with 4 CPUs and 8 GB RAM. Each task runs in a Docker container with the target repository pre-installed.

License

MIT.

Tasks

There is one split in this environment:

  • test: Validated multilingual SWE-bench instances

Tasks span multiple programming languages from real GitHub repositories.

Reward Structure

This is a multi-turn environment. The agent explores the codebase, makes code modifications, and calls answer to submit. The environment runs the SWE-bench evaluation harness to check if:

  1. All fail-to-pass tests now pass
  2. All pass-to-pass tests still pass

Reward is binary: 1.0 if the issue is resolved (all tests pass), 0.0 otherwise.

Data

Data consists of SWE-bench instances sourced from HuggingFace SWE-bench/SWE-bench_Multilingual. Each task includes a problem statement, repository information, base commit, and test specifications.

Tools

ToolDescription
bashExecute shell commands in the sandbox
viewView file contents with optional line range
str_replaceReplace strings in files
insertInsert content at a specific line
createCreate new files
answerSubmit final patch for evaluation. Ends the episode.

Time Horizon

Multi-turn. The agent reads the problem statement, explores the codebase, implements fixes, and submits for evaluation.

Environment Difficulty

SWE-Bench-Multilingual evaluates real-world software engineering capabilities across multiple programming languages.

ModelResolve Rate
MiniMax M2.574.1%
GLM-573.3%
Kimi K2.573.0%
Gemini 3.1 Pro72.0%
Qwen 3 Coder Next (OpenHands)64.3%

Other Environment Requirements

No other secrets required other than OpenReward API key.

Safety

Agents in SWE-Bench-Multilingual work within sandboxed Docker containers. Code execution is isolated and the environment does not present direct safety risks.

Citation

@article{yang2025swesmith,
  title={SWE-smith: Scaling Data for Software Engineering Agents},
  author={Yang, John and Lieret, Kilian and Jimenez, Carlos E. and Wettig, Alexander and Khandpur, Kabir and Zhang, Yanzhe and Hui, Binyuan and Press, Ofir and Schmidt, Ludwig and Yang, Diyi},
  journal={arXiv preprint arXiv:2504.21798},
  year={2025}
}
GeneralReasoning/SWE-Bench-Multilingual | OpenReward