API Endpoint

Leaderboard

Loading leaderboard...

Implementation of

arXiv/gravity-bench-v1

README

GravityBench

Description

GravityBench is an environment for evaluating AI agents on gravitational physics discovery tasks. Based on the GravityBench benchmark (ICML 2025), agents must iteratively observe a simulated two-body gravitational system within an observation budget, analyze position data using code, and solve physics problems ranging from orbital mechanics to modified gravity detection.

Capabilities

Planning observation strategies within a limited budget (100 observations total)
Analyzing time-series position data for binary star systems
Deriving physical quantities (orbital period, mass, eccentricity, etc.) from observations
Detecting non-standard physics (modified gravity power laws, linear drag coefficients)
Scientific reasoning and code-based data analysis

Compute Requirements

Agents in GravityBench are given a sandbox with 0.5 CPUs and 1GB RAM with a Python data science image (numpy, pandas, scipy, sklearn, statsmodels).

License

MIT License, following the original GravityBench repository.

Tasks

There is one split (test) with 206 tasks. Each task presents a binary star system scenario where the agent must determine a physical quantity or property. Tasks span 50+ physics scenarios including:

Orbital mechanics: period, semi-major axis, eccentricity, apoastron, periastron
Stellar properties: mass determination, mass ratios, reduced mass
Dynamics: velocities, accelerations, angular velocities, momentum
Energy: kinetic + potential energy, virial theorem
Advanced: Roche lobe radius, modified gravity power law, linear drag, Kepler's 3rd law
Classification: bound/unbound system determination

Some tasks use non-standard unit systems (years/AU or CGS) and include out-of-distribution physics (non-Newtonian gravity) to test genuine scientific generalization.

Reward Structure

This is a sparse, deterministic reward environment with binary scoring. The agent calls submit_answer once with its answer:

For numeric answers: reward = 1.0 if percent error <= task-specific threshold, else 0.0
For boolean answers (e.g., is the system bound?): reward = 1.0 if exact match, else 0.0

Thresholds vary by task (5% to 70%), reflecting the difficulty of each physics problem under observation budget constraints.

We do not use LLM graders for this task.

Data

Task data is sourced from the GravityBench HuggingFace dataset. Each task contains simulation data generated using the REBOUND N-body integrator, with time-series position measurements for both stars in a binary system.

Tools

Agents are given access to CLI tools (bash, read, write, edit, glob, grep, ls, multi_edit, todo_write) and two environment-specific tools:

observe: Request position observations at specific times within the simulation window. Limited to 10 observations per request and 100 total. Returns cubic spline-interpolated positions for both stars.
submit_answer: Submit a numeric or boolean answer for grading against ground truth with a task-specific percent error threshold.

Time Horizon

GravityBench is a multi-turn environment. Agents iteratively observe the system, write and run Python analysis code via bash, and submit an answer.

Other Environment Requirements

There are no further environment requirements; GravityBench works out of the box with the OpenReward endpoint without any additional secrets.

Safety

Agents in GravityBench interact only with physics simulation data within a sandboxed environment. The environment does not present direct safety risks.

Citations

@inproceedings{koblischke2025gravitybench,
  title={Gravity-Bench-v1: A Benchmark for AI Discovery of Physics from Observations},
  author={Koblischke, Nolan and Jang, Hyunseok and Menou, Kristen and Ali-Dib, Mohamad},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2025},
  url={https://arxiv.org/abs/2501.18411}
}

Repository

Source repository

EnvCommons/GravityBench

Clone Repository

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	0.5 vCPUs / 1 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000115
Total	$0.0000435

Examples

5-minute session$0.0131

1-hour session$0.1566

GravityBench

GeneralReasoning/GravityBench

GravityBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Other Environment Requirements

Safety

Citations

Repository

Clone Repository

Tools

Compute Configuration

Estimated Cost

Examples