GSO

Description

GSO (General Software Optimization) is an environment for evaluating an agent's ability to optimize the runtime performance of real-world software. Each task provides a codebase and a performance test, and the agent must improve runtime efficiency. Tasks are derived from expert developer optimizations in commit histories of popular open-source libraries.

This OpenReward implementation is ported from the Harbor Framework implementation originally made by Ruofan Lu.

Capabilities

Profiling and identifying performance bottlenecks in codebases
Implementing algorithmic and systems-level optimizations
Working across multiple programming languages and domains
Writing performance-correct code that passes existing test suites

Compute Requirements

Agents are given a sandboxed environment with bash access and file editing tools. Default sandbox size is 1 CPU and 2 GB RAM, configurable per task.

License

MIT.

Tasks

There is one split in this environment:

Test: 90 software optimization tasks

Tasks span 10 codebases across diverse domains and programming languages, including HuggingFace (datasets, tokenizers, transformers), NumPy, pandas, pydantic, Pillow, and Tornado.

Reward Structure

This is a multi-turn, sandbox-based environment. The agent profiles, modifies, and tests code, then calls submit_answer for verification. The verifier measures runtime before and after the agent's patch, comparing against the expert developer's optimization.

1.0: Agent's optimization achieves ≥95% of the expert's speedup and passes correctness tests.
0.0: Optimization is insufficient or breaks correctness.

Data

Each task directory contains an instruction.md with the optimization target and a tests/ directory with performance benchmarks. Task data is stored on the OpenReward platform.

Tools

Tool	Description
`bash`	Execute shell commands in the sandbox.
`str_replace`	Replace a unique string in a file.
`view`	View file contents or list directory contents.
`create_file`	Create a new file with specified content.
`submit_answer`	Submit work for automated performance verification.

Time Horizon

GSO is a multi-turn environment. Agents analyze the codebase, identify bottlenecks, implement optimizations, test correctness, and submit for verification.

Environment Difficulty

GSO is challenging. Top performers on the GSO leaderboard:

Model	Opt@1
Claude-4.6-Opus	33.3%
GPT-5.2 (high)	27.5%
Claude-4.5-Opus	26.5%
Gemini-3-Pro	18.6%
Claude-4.5-Sonnet	14.7%

GSO solutions require 4-15x larger edits than existing benchmarks. Agents frequently resort to superficial "lazy optimizations" like compiler flags rather than genuine algorithmic improvements.

Other Environment Requirements

There are no further environment requirements; GSO works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in GSO optimize open-source software in a sandboxed environment. The environment does not present direct safety risks.

Citations

@inproceedings{shetty2025gso,
  author    = {Manish Shetty and Naman Jain and Jinjian Liu and Vijay Kethanaboyina and Koushik Sen and Ion Stoica},
  title     = {GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents},
  booktitle = {NeurIPS 2025 Datasets and Benchmarks Track},
  year      = {2025},
  url       = {https://arxiv.org/abs/2505.23671}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	Not configured

Component	Cost / second
Environment	$0.0000320
Sandbox	Not configured
Total	$0.0000320

gso

GeneralReasoning/gso

GSO

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples