complex_worlds_hack
RoboCerebra Reward Lab
OpenReward benchmark for long-horizon physical-AI
planning with dense, VLM-style subgoal rewards. The repo implements two
sides of the same story: a single-task breakfast-tray slice with symbolic
dense reward and tabular baselines (scripts/run_demo.py), and Shift
Mode — full hospitality shifts on the host where each episode is a
deterministic chain of 12–30 manipulation jobs (inventory, memory, clock,
tickets, scheduled non-stationary events) wired through 18 OpenReward tools.
On test shifts, a capable agent burns ~1660 tool calls to win; the
strongest hand-written baseline burns ~1130 and still fails.
Paper (methods, splits, leaderboard schema): RoboCerebra Reward Lab on OSF
Winning claim
Short-horizon benchmarks reward one good guess. This environment targets the
opposite: success requires long-horizon planning, credit assignment across
macro tool calls, and (in Shift Mode) memory recall, inventory tracking, event
handling, and eventual self-summarization — for shifts, that means composing
18 OpenReward tools over hundreds to thousands of calls before the
terminal success bit. The OSF paper above spells out reward decomposition,
splits (train / validation / test), and the evaluation schema for
judge-style write-ups and leaderboard entries.
Quick Start
python3 -m venv .venv
.venv/bin/python -m pip install -e ".[dev]"
.venv/bin/python scripts/run_demo.pyThe demo writes:
artifacts/metrics/leaderboard.json— breakfast-tray randomized held-out results, smoke baselines, and headline lifts (see Metrics).artifacts/metrics/randomized_policy_report.json— per-policy breakdown for the 80-episode held-out run.artifacts/plots/training_curve.pngartifacts/replays/baseline_random.gifartifacts/replays/dense_trained.gif
Build a browser report from those artifacts:
.venv/bin/python scripts/build_visual_report.py
.venv/bin/python scripts/build_side_by_side.py
open artifacts/visual_report/index.htmlLive Gemini for score_progress: set GEMINI_API_KEY or GOOGLE_API_KEY, and
optionally GEMINI_MODEL (default gemini-2.5-flash). The OpenReward server and
run_demo sample VLM call use the API whenever a key is present. Set
ROBOCEREBRA_FORCE_SYMBOLIC_VLM=1 to force the deterministic symbolic scorer (e.g. CI, offline
repro) even if a key exists.
OpenReward Environment
The environment class is robocerebra_rl.env.RoboCerebraRewardLabEnv.
Run a local OpenReward-compatible server:
.venv/bin/python -m robocerebra_rl.envThis is an API server, not a browser UI. See
docs/openreward_deploy.md for the correct session-header flow and hosted
OpenReward deployment commands.
Core tools:
observe: returns current task state plus a rendered frame.choose_subgoal: records the agent's intended semantic subgoal.execute_skill: advances the world by one macro-action and returns dense reward.score_progress: returns cached Gemini-style progress scoring.submit_done: terminates the episode with final success reward.
Benchmark Task — Long-Horizon Shift Mode
Each OpenReward task is a full hospitality shift: a chain of 12–30 jobs
(breakfast tray, spill recovery, countertop cleanup) running on persistent
shift state — inventory, memory, clock, ticket queue, deterministic
non-stationary events — until the agent can submit a verified shift
summary. This is what gives us hundreds-to-thousands of tool calls per
episode.
| Split | Shifts | Jobs / shift | Events / shift | Median expert tool calls |
|---|---|---|---|---|
train | 76 | 12 | 3 | ~634 |
validation | 16 | 22 | 6 | ~1189 |
test | 16 | 30 | 9 | ~1660 |
(test shifts can grow past 1700 calls when scheduled spill events insert
extra recovery jobs.)
Tool surface (18 tools):
- Per-job loop:
observe,choose_subgoal,execute_skill,score_progress. - Plan & memory:
read_ticket,plan_create,plan_revise,
memory_write,memory_read,memory_search,memory_summarize. - Resource & time:
inventory_check,inventory_consume,inventory_restock,
clock_get. - Disturbance lifecycle:
acknowledge_event,log_job,submit_done.
Why this is hard but tractable:
- Long horizon: 1500–1700 expert tool calls on
test; reactive baseline
pushes ~1130 calls and still fails because it never acknowledges events
or summarizes memory. - Capability tangent: success requires real planning, recall via
memory_search, inventory restocking on stockouts, andplan_reviseon
VIP / time-pressure events. - Solvable: the deterministic per-seed expert oracle solves every shift
with 100% success within budget, proving tractability.
Metrics
There are two evaluation surfaces; use the one that matches what you are reporting.
1. Single-task breakfast tray (80-episode randomized held-out). Produced by
scripts/run_demo.py. Read artifacts/metrics/leaderboard.json under
randomized_heldout and headline, or artifacts/metrics/randomized_policy_report.json.
Report (per policy): success_rate (with success_rate_ci95 if you show intervals),
mean_progress, disturbance_recovery_rate (with disturbance_recovery_ci95),
mean_reward (symbolic dense reward), mean_ticks, mean_tool_calls, and
episodes (80). Regime label: randomized_heldout (80 episodes per policy).
2. Shift Mode (hosted OpenReward environment). Full hospitality shifts with
the 18-tool surface. Use scripts/benchmark_openreward.py against
vikkash/complex_worlds_hack (or your fork). Output JSON includes split,
episodes, success_rate, success_rate_ci95, mean_reward, mean_tool_calls,
and related shift aggregates. Headline on test: expert 100% success @ ~1660
median calls vs reactive 0% success @ ~1130 calls — same long-horizon work,
only the capability-rich policy wins.
artifacts/metrics/leaderboard.json produced by run_demo is breakfast-only
(the structure above). Shift-mode numbers are not in that file; they come
from scripts/benchmark_openreward.py (see Hosted benchmark bundle), with
per-policy expert, reactive_script, and random including:
success_rate— gated on completing every job, acknowledging every
scheduled event, and a passingmemory_summarize.mean_tool_calls,p50_tool_calls,min_tool_calls/max_tool_calls
(hosted JSON frombenchmark_openreward.py).mean_events_handled,mean_memory_recalls,mean_inventory_restocks,
mean_score_progress_calls,mean_tool_diversity.
OpenReward leaderboard (site)
When you add a leaderboard entry on openreward.ai for this environment:
- Split — Use
testfor hostedbenchmark_openreward.pyresults. For the
80-episode breakfast held-out numbers fromrun_demo, the platform may only
offeroverall; that is fine if you name the regime in the metrics (below). - Metrics — Add one name/value row per scalar. Prefer these names:
- Always:
success_rate,mean_reward,episodes,mean_tool_calls. - Breakfast held-out: also
mean_progress,mean_ticks,disturbance_recovery_rate, and
optionalsuccess_rate_ci95_low/success_rate_ci95_high(and the same
for recovery) if the form accepts them. - Optional context row:
regime=randomized_heldoutfor breakfast numbers
fromrun_demo; for hosted shift runs, putsplit=test(or your split) in
the form’s Split field andepisodesin metrics.
- Always:
- Provenance — Link the OSF preprint (
osf.io/h8rnv) in
the form’s Paper field where available, and note commit SHA, whether
score_progressused live Gemini or symbolic fallback, and the exact command.
Do not use a generic Accuracy field unless the UI requires a single primary
column; use success_rate for the main success metric.
Two-minute pitch
“Post-SWE frontier agents need environments that test long-horizon physical
reasoning, not just code tasks. RoboCerebra Reward Lab exposes a manipulation
workflow as an OpenReward environment with semantic tools, dense VLM-style
progress rewards, and verifiable rollouts. Dense rewards improve sample
efficiency over sparse success-only feedback, producing metric curves and replay
videos that show recovery from non-stationary disturbances.” Full setup,
headline tables, and reporting conventions are in the OSF paper.
Stretch Path
Use Brev/Isaac only after the core metrics are generated. The safe stretch is a
visual replay, not training. If Isaac setup takes more than 45 minutes, preserve
the benchmark story and submit the OpenReward environment plus metrics.
For the cloud visualization path and Isaac Sim core replay, see
docs/brev_visualization.md. The Isaac script consumes the same JSONL traces
used by the OpenReward evaluation so the 3D replay is tied to actual tool calls.
Limitations
This MVP is a macro-level physical-AI benchmark slice, not full continuous robot
control. The purpose is to demonstrate a verifiable long-horizon evaluation and
dense reward strategy that can later be connected to real RoboCerebra/LIBERO
rollouts.
License
Released under the MIT License.
Hosted benchmark bundle
When comparing two scripted policies through OpenReward, pass
--compare-policy to scripts/benchmark_openreward.py. Besides the per-policy
JSON files, it also writes artifacts/openreward/submission_benchmark_summary.json
with both metric dicts and the lift summary for judge-facing write-ups. Copy
values from the metrics object in that file (or the per-policy output JSON)
into the site leaderboard form as described under OpenReward leaderboard (site)
above.