GBAGym
GBA-Emu-Gym
OpenReward gym port of mechanize-work/gba-eval — agents iteratively submit GBA emulator wasm builds and earn reward equal to the high-water mark of the grader's score.
Upstream is a one-shot 24-hour eval: the agent gets one Linux container with a Rust toolchain, writes a GBA emulator from scratch, and at the deadline a held-out grader runs the wasm against ~27 testcases (homebrew gameplay replays + procedural CPU/memory/DMA test ROMs + audio diffs against Mesen2). One submission, one number.
This is a gym: same shape, but the agent can submit any number of times per episode. Each submit call earns
reward = max(0, new_overall - previous_best_overall)
so an episode's cumulative reward equals the high-water mark of overall ∈ [0, 1]. A worse submission earns zero, no penalty. The agent's incentive is simple: ship something that compiles, get a baseline, then improve.
Status
End-to-end pipeline is verified live on OpenReward as GeneralReasoning/GBAGym. A no-op stub submission scores overall ≈ 0.0085 in roughly 3 minutes of grader wall-clock.
How the agent interacts with the env
Inside the agent's sandbox:
| Path | What it is |
|---|---|
/task/ | Agent's working directory. Empty cargo workspace expected — agent creates Cargo.toml + src/lib.rs. |
/task/spec/ABI.md | 10 C-ABI functions the agent's wasm must export (emu_init, emu_load_rom, emu_set_keys, emu_run_frame, emu_framebuffer, emu_audio_buffer, …). |
/task/spec/gba_bios_stub.bin | 16 KiB ARM BIOS stub mapped at 0x00000000. |
/task/spec/gbatek.htm | 5 MB GBATEK hardware reference. |
/task/dev-roms/ | 5 visible GBA ROMs (a strict subset of the grader's full ROM set — see below). |
Available tools (per session):
| Tool | What it does |
|---|---|
bash, read, write, edit, glob, grep, todo_write | Standard Claude-Code-style file/shell tools, scoped to the agent's sandbox. |
oracle_run(rom_path, frames, replay_text?) | Run a ROM through the reference Mesen2 emulator in a separate sandbox; per-frame PPMs + WAV audio land back as a tarball at /task/.oracle-out/<run-id>.tar. Mesen2 itself is never visible to the agent. |
submit | Grade whatever wasm is currently at /task/target/wasm32-unknown-unknown/release/gba_emu.wasm. Returns overall + section subscores. Per-testcase scores are deliberately not surfaced. |
give_up | End the episode early. |
The agent is responsible for running cargo build themselves — submit does not build for them, it only grades the artifact at the canonical path.
ROM split
To preserve anti-shimming pressure while still letting the agent develop against real ROMs:
| Bucket | ROMs |
|---|---|
Visible (in dev-roms/, also graded) | spout, waimanu (gameplay replays); armwrestler (procedural test) |
Pure dev (in dev-roms/, never graded) | trogdor, another-world |
| Held-out (grader only) | celeste-classic, varooom-3d, bulletgba, chip-advance, collie-defense, goodboy-advance, heartwrench-advance, piugba, mgba-suite, jsmolka/memory, destoer/dma-priority, several nba-hw/* ROMs, tonc/snd1-demo, audio test ROMs |
The agent never sees the held-out names and submit only returns section aggregates (replay, procedural, audio), so they cannot reverse-engineer the held-out set from feedback.
Architecture
Each session has two sandboxes, both owned by env code (the agent never touches the eval sandbox):
┌──────────────────────────────────────────────────────────────────────┐
│ OpenReward env server (FastAPI, deployed from this repo) │
│ │
│ ┌────────────────────────┐ ┌────────────────────────────┐ │
│ │ agent_sandbox 4:16 │ │ eval_sandbox 4:16 │ │
│ │ ──────────────────── │ │ ────────────────────── │ │
│ │ rust 1.87 + wasmtime │ │ oracle binary │ │
│ │ clang / cmake / py3 │ │ grader binary │ │
│ │ spec/, dev-roms/ │ │ mesen.wasm │ │
│ │ │ │ full corpus (held-out) │ │
│ │ agent edits Rust here │ │ ref-cache (~230 MB LFS) │ │
│ │ ▲ │ │ ▲ │ │
│ └──┼─────────────────────┘ └──┼─────────────────────────┘ │
│ │ Claude-Code-style tools │ shell-in via env code │
│ │ + oracle_run + submit + give_up │ for oracle / grader runs │
│ └──────────── env code ────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
Both sandboxes are pinned by GHCR digest (images/{task,eval}.sha, written by CI). Network is blocked on both pods.
oracle_run flow
- Env downloads ROM bytes from
agent_sandboxatrom_path. - Stages ROM (and optional replay file) in
eval_sandboxat/eval/scratch/<run-id>/. - Runs
oracle run rom.gba <frames> --replay … --dump-frames frames/ --dump-audio audio.wavineval_sandbox. - Tars
frames/ audio.wavtogether, downloads the tarball, uploads it to the agent's/task/.oracle-out/<run-id>.tar.
submit flow
- Env downloads
/task/target/wasm32-unknown-unknown/release/gba_emu.wasmfromagent_sandbox. Errors clearly if absent (agent forgot to build). - Uploads it to
eval_sandboxat/eval/scratch/grade-<run-id>/candidate.wasm. - Runs
/eval/bin/grader --reference /opt/gba-eval/mesen.wasm candidate.wasm /eval/corpus <out>(15-min timeout). - Reads
<out>/summary.json→ returns scalaroverall+ section scores. Cleans the scratch dir. - Computes
delta = max(0, overall - best_score), updatesbest_score, returnsdeltaas reward.
Why this isolation matters
Mirrors upstream's services/task split: mesen.wasm and the oracle binary live only in the eval sandbox, never reachable from the agent's container. Stronger than upstream's HTTP-sidecar design — every interaction is a Pydantic-schema-validated tool call, no wire protocol exposed, and the agent has no network path to the reference at all.
Wasmtime memory patch
The default wasmtime::Config reserves ~4 GiB of virtual address space per linear memory + a 2 GiB guard region. The grader instantiates two wasms (reference + candidate) so a single submit would want ~12 GiB of VAS — the OpenReward sandbox's kernel/cgroup vm limits reject the mmap even at 4:16 (the largest non-GPU machine size). docker/eval.Dockerfile applies a small in-place patch to upstream/harness/grader/src/wasm_candidate.rs that switches every memory to dynamic mode with a 64 KiB guard, eliminating the giant mmap. Without it, every submit errors at instantiation with mmap failed to reserve 0x200000000 bytes.
Repo layout
server.py OpenReward env server entry point
env.py GBAEmuGym Environment — dual-sandbox flow + tools
TASK.md Agent-facing prompt (baked into the task image)
Dockerfile Env server image (FastAPI, deployed by OpenReward)
docker/task.Dockerfile Agent sandbox image
docker/eval.Dockerfile Eval sandbox image (oracle + grader + corpus)
images/{task,eval}.sha Digest-pinned GHCR image refs (CI writes these)
upstream/ Git submodule → mechanize-work/gba-eval @ pinned SHA
.github/workflows/
build-images.yml Builds task + eval images on every push,
pushes to GHCR, commits digests back to images/
requirements.txt openreward, pydantic (env server runtime deps)
pyproject.toml Project metadata
runner/ (gitignored) Local dev tooling — interactive
step_through script, snapshot extractor
Reward shape
| Property | Value |
|---|---|
| Range | overall ∈ [0, 1], cumulative episode reward ≤ 1.0 |
| Sign | Monotone non-decreasing — worse submissions earn 0, no penalty |
| Termination | No auto-finish from submit. Episode ends when the agent calls give_up or the harness enforces a step/wall-clock cap |
| Scoring | overall = 0.60 × replay + 0.20 × procedural + 0.20 × audio (configurable in corpus/grader.yaml, but agents can't see it) |
If you want larger reward magnitudes for training, scale at the trainer.
Local setup
git clone --recurse-submodules <this-repo>
cd GBA-Emu-Gym
git -C upstream lfs install && git -C upstream lfs pull # ~230 MB ref cache
uv pip install -r requirements.txtBuild the sandbox images locally:
docker build -f docker/task.Dockerfile -t gba-emu-gym-task:dev .
docker build -f docker/eval.Dockerfile -t gba-emu-gym-eval:dev .CI does this automatically on every push to main and writes the resulting GHCR digests back to images/{task,eval}.sha. The env code reads those pins via env.py:_read_image_pin(...) and falls back to :latest tags for local development.
License
Inherits upstream's per-file licensing — see upstream/LEGAL.md. Briefly:
- Original work in this repo (env.py, server.py, Dockerfiles, README, TASK.md): MIT
- Upstream harness/spec/corpus (non-ROM): MIT
corpus/roms/: per-ROM licenses (homebrew + open test ROMs)- Mesen2 wasm + build glue: GPL-3.0
spec/gba_bios_stub.bin: clean-room MIT (not a Nintendo dump)
Known limitations
oracle_runis slow per call — a 600-frame run shuttles a ~70 MB tarball through env code (agent ↔ env ↔ eval, base64-over-HTTP). Workable, but agents that want many high-frame queries will see latency. Mitigations on the table: cap frames lower, add a session-style tool that holds state across many small steps, or ship a one-shot tool that returns a similarity score directly (no frame bytes cross the boundary).- Each
submitis 1-5 minutes of grader CPU. Episodes with hundreds of submits get expensive. Trainers should bound submit frequency. - Visible ROMs are also graded, so the agent gets some direct signal from the visible set. Designed this way intentionally — gives the agent a tractable iteration loop without leaking the bulk of the corpus.
Citations
@misc{gbaeval2026,
title = {GBA Eval},
author = {Mechanize Inc.},
year = {2026},
url = {https://gbaeval.com/},
note = {Upstream eval; this repo ports it to a multi-submit gym.},
}