GravityBench
GravityBench
Description
GravityBench is an environment for evaluating AI agents on gravitational physics discovery tasks. Based on the GravityBench benchmark (ICML 2025), agents must iteratively observe a simulated two-body gravitational system within an observation budget, analyze position data using code, and solve physics problems ranging from orbital mechanics to modified gravity detection.
Capabilities
- Planning observation strategies within a limited budget (100 observations total)
- Analyzing time-series position data for binary star systems
- Deriving physical quantities (orbital period, mass, eccentricity, etc.) from observations
- Detecting non-standard physics (modified gravity power laws, linear drag coefficients)
- Scientific reasoning and code-based data analysis
Compute Requirements
Agents in GravityBench are given a sandbox with 0.5 CPUs and 1GB RAM with a Python data science image (numpy, pandas, scipy, sklearn, statsmodels).
License
MIT License, following the original GravityBench repository.
Tasks
There is one split (test) with 206 tasks. Each task presents a binary star system scenario where the agent must determine a physical quantity or property. Tasks span 50+ physics scenarios including:
- Orbital mechanics: period, semi-major axis, eccentricity, apoastron, periastron
- Stellar properties: mass determination, mass ratios, reduced mass
- Dynamics: velocities, accelerations, angular velocities, momentum
- Energy: kinetic + potential energy, virial theorem
- Advanced: Roche lobe radius, modified gravity power law, linear drag, Kepler's 3rd law
- Classification: bound/unbound system determination
Some tasks use non-standard unit systems (years/AU or CGS) and include out-of-distribution physics (non-Newtonian gravity) to test genuine scientific generalization.
Reward Structure
This is a sparse, deterministic reward environment with binary scoring. The agent calls submit_answer once with its answer:
- For numeric answers: reward = 1.0 if percent error <= task-specific threshold, else 0.0
- For boolean answers (e.g., is the system bound?): reward = 1.0 if exact match, else 0.0
Thresholds vary by task (5% to 70%), reflecting the difficulty of each physics problem under observation budget constraints.
We do not use LLM graders for this task.
Data
Task data is sourced from the GravityBench HuggingFace dataset. Each task contains simulation data generated using the REBOUND N-body integrator, with time-series position measurements for both stars in a binary system.
Tools
Agents are given access to CLI tools (bash, read, write, edit, glob, grep, ls, multi_edit, todo_write) and two environment-specific tools:
- observe: Request position observations at specific times within the simulation window. Limited to 10 observations per request and 100 total. Returns cubic spline-interpolated positions for both stars.
- submit_answer: Submit a numeric or boolean answer for grading against ground truth with a task-specific percent error threshold.
Time Horizon
GravityBench is a multi-turn environment. Agents iteratively observe the system, write and run Python analysis code via bash, and submit an answer.
Other Environment Requirements
There are no further environment requirements; GravityBench works out of the box with the OpenReward endpoint without any additional secrets.
Safety
Agents in GravityBench interact only with physics simulation data within a sandboxed environment. The environment does not present direct safety risks.
Citations
@inproceedings{koblischke2025gravitybench,
title={Gravity-Bench-v1: A Benchmark for AI Discovery of Physics from Observations},
author={Koblischke, Nolan and Jang, Hyunseok and Menou, Kristen and Ali-Dib, Mohamad},
booktitle={International Conference on Machine Learning (ICML)},
year={2025},
url={https://arxiv.org/abs/2501.18411}
}