API Endpoint

Leaderboard

Loading leaderboard...

README

R2E-Gym

Description

R2E-Gym is a real-world software engineering ORS training environment built on 8,100+ executable coding problem environments with procedurally generated and hybrid-verified test suites. Each task presents the agent with a real GitHub repository containing a bug or missing feature, and the agent must explore the codebase, diagnose the issue, and produce a working patch. Tasks are drawn from real open-source repositories and cover bug fixes, feature implementations, and other software maintenance activities.

Capabilities

Real-world software engineering across diverse open-source repositories
Bug diagnosis and fixing in production codebases
Feature implementation from natural language specifications
Codebase exploration and navigation
Test-based verification of solutions

Compute Requirements

Each task runs in an isolated Docker sandbox provisioned with 4 CPUs and 8GB of RAM. Per-task Docker images are pulled at runtime, each pre-configured with the target repository and its dependencies.

License

Apache 2.0.

Tasks

There are two splits in this environment:

all: ~8,100 tasks sourced from the R2E-Gym-V1 dataset. Each task corresponds to a real commit in an open-source repository, with a problem statement describing the issue and withheld tests for grading.
subset: ~4,578 tasks sourced from the R2E-Gym-Subset dataset. A curated subset of the full dataset.

Both splits are typed as train splits. Each task specifies a repo_name, docker_image, commit_hash, problem_statement, and expected_output_json for grading.

Reward Structure

Rewards are binary and deterministic. The answer tool runs withheld pytest tests inside the sandbox and parses the test output into a mapping of test names to statuses (PASSED, FAILED, ERROR). This mapping is compared against the expected output via exact match:

1.0 if all test results match the expected output exactly.
0.0 if there is any mismatch in test count or individual test status.

No LLM graders are used. Grading is fully deterministic based on pytest output parsing.

Data

Task data is loaded at runtime from HuggingFace datasets:

R2E-Gym/R2E-Gym-V1 for the "all" split
R2E-Gym/R2E-Gym-Subset for the "subset" split

A gold_patches.csv file (42MB) is included in the repository for validation testing. It contains reference patches keyed by commit_hash that, when applied, should produce a reward of 1.0.

Tools

Tool	Parameters	Description
`bash`	`command: str`	Execute a bash command in the sandbox's `/root/.venv` environment. Commands have a 600-second timeout by default. Returns stdout/stderr and exit code.
`answer`	(none)	Restores withheld tests, runs them via pytest, and computes the final score by comparing test results against expected output. Ends the episode. Can only be called once.

Time Horizon

R2E-Gym is a multi-turn environment. The agent receives a problem statement describing a bug or feature request, then iteratively uses the bash tool to explore the repository, understand the codebase, identify the relevant files, implement a fix, and verify the solution. When the agent is confident in its patch, it calls answer to run the withheld tests and receive a final score.

Environment Difficulty

[Put environment difficulty here]

Other Environment Requirements

No external API keys required.

Safety

Agents operate in isolated Docker sandboxes with no access to the host system or external network beyond what is required for the task. Each sandbox is provisioned per-task and destroyed after the episode completes. The agent's actions are confined to the /testbed directory containing the target repository.

Citation

@article{jain2025r2egym,
  title={R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents},
  author={Jain, Naman and Singh, Jaskirat and Shetty, Manish and Zheng, Liang and Sen, Koushik and Stoica, Ion},
  journal={arXiv preprint arXiv:2504.07164},
  year={2025}
}

Repository

Source repository

GeneralReasoning/env-r2e-gym

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	2 vCPUs / 4 GB RAM
Sandbox Machine	4 vCPUs / 8 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000460
Sandbox	$0.0000920
Total	$0.0001380

Examples

5-minute session$0.0414

1-hour session$0.4968

R2E-Gym

Naman/R2E-Gym

R2E-Gym

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples