gso
GSO
Description
GSO (General Software Optimization) is an environment for evaluating an agent's ability to optimize the runtime performance of real-world software. Each task provides a codebase and a performance test, and the agent must improve runtime efficiency. Tasks are derived from expert developer optimizations in commit histories of popular open-source libraries.
This OpenReward implementation is ported from the Harbor Framework implementation originally made by Ruofan Lu.
Capabilities
- Profiling and identifying performance bottlenecks in codebases
- Implementing algorithmic and systems-level optimizations
- Working across multiple programming languages and domains
- Writing performance-correct code that passes existing test suites
Compute Requirements
Agents are given a sandboxed environment with bash access and file editing tools. Default sandbox size is 1 CPU and 2 GB RAM, configurable per task.
License
MIT.
Tasks
There is one split in this environment:
- Test: 90 software optimization tasks
Tasks span 10 codebases across diverse domains and programming languages, including HuggingFace (datasets, tokenizers, transformers), NumPy, pandas, pydantic, Pillow, and Tornado.
Reward Structure
This is a multi-turn, sandbox-based environment. The agent profiles, modifies, and tests code, then calls submit_answer for verification. The verifier measures runtime before and after the agent's patch, comparing against the expert developer's optimization.
- 1.0: Agent's optimization achieves ≥95% of the expert's speedup and passes correctness tests.
- 0.0: Optimization is insufficient or breaks correctness.
Data
Each task directory contains an instruction.md with the optimization target and a tests/ directory with performance benchmarks. Task data is stored on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
bash | Execute shell commands in the sandbox. |
str_replace | Replace a unique string in a file. |
view | View file contents or list directory contents. |
create_file | Create a new file with specified content. |
submit_answer | Submit work for automated performance verification. |
Time Horizon
GSO is a multi-turn environment. Agents analyze the codebase, identify bottlenecks, implement optimizations, test correctness, and submit for verification.
Environment Difficulty
GSO is challenging. Top performers on the GSO leaderboard:
| Model | Opt@1 |
|---|---|
| Claude-4.6-Opus | 33.3% |
| GPT-5.2 (high) | 27.5% |
| Claude-4.5-Opus | 26.5% |
| Gemini-3-Pro | 18.6% |
| Claude-4.5-Sonnet | 14.7% |
GSO solutions require 4-15x larger edits than existing benchmarks. Agents frequently resort to superficial "lazy optimizations" like compiler flags rather than genuine algorithmic improvements.
Other Environment Requirements
There are no further environment requirements; GSO works out of the box with the OpenReward endpoint without any external API keys.
Safety
Agents in GSO optimize open-source software in a sandboxed environment. The environment does not present direct safety risks.
Citations
@inproceedings{shetty2025gso,
author = {Manish Shetty and Naman Jain and Jinjian Liu and Vijay Kethanaboyina and Koushik Sen and Ion Stoica},
title = {GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents},
booktitle = {NeurIPS 2025 Datasets and Benchmarks Track},
year = {2025},
url = {https://arxiv.org/abs/2505.23671}
}