API Endpoint

Leaderboard

Loading leaderboard...

README

VolForecast

Description

VolForecast is an environment for evaluating language model agents on financial volatility forecasting tasks. Agents analyze historical market data and develop forecasting strategies to predict daily realized volatility (squared log returns) across different economic periods and market conditions, one trading day at a time over a 252-day horizon.

Capabilities

Analyzing historical price data and market indicators
Developing volatility forecasting models and strategies
Sequential decision-making over a 252-step forecast horizon
Iterating on predictions based on feedback from prior days
Long-horizon multi-turn execution

Compute Requirements

Agents in VolForecast are given a sandbox with 0.5 CPUs and 1GB RAM, network access enabled, and a Python 3.12 data science image.

Tasks

There are 130 tasks across 4 economic scenarios and 2 splits.

Split	Scenarios	Tickers	Tasks
train	Pre-Crisis, Financial Crisis, Recovery	40	120
test	Covid-Era	10	10

Scenarios:

Pre-Crisis (train): Historical data 2005-2007, forecasting from 2008
Financial Crisis (train): Historical data 2006-2008, forecasting from 2009
Recovery (train): Historical data 2008-2015, forecasting from 2016
Covid-Era (test): Historical data 2018-2021, forecasting from 2022

Each scenario contains multiple tickers (individual stocks), resulting in one task per ticker per scenario.

Each task requires the agent to make 252 sequential daily volatility predictions (one year of trading days). The agent receives historical price data and market indicators, and must submit one prediction at a time via make_prediction. After each prediction, the agent receives the actual realized volatility and a reward signal comparing its prediction to a risk metrics baseline.

Reward Structure

This is a dense, verifiable reward environment. After each daily prediction, the reward is the difference between the baseline's combined loss and the agent's combined loss:

$\text{Reward} = \mathcal{L}_{\text{baseline}} - \mathcal{L}_{\text{agent}}$

where the combined loss is the average of Mean Squared Error (MSE) and QLIKE (Quasi-Maximum Likelihood) loss:

$\mathcal{L} = \frac{1}{2}\left(\text{MSE} + \text{QLIKE}\right)$

Positive reward means the agent outperformed the baseline; negative means it underperformed. The final reward returned at task completion is the cumulative sum of all 252 daily rewards.

We do not use LLM graders for this task.

Data

Agents are provided with historical price data and market indicators mounted in the sandbox at /tmp/gr-datasets/. Data includes daily closing prices from which the agent must compute realized volatility (squared log returns). The baseline uses an exponentially weighted moving average model (lambda=0.94).

Tools

Agents are given access to CLI tools for creating, viewing, and searching a filesystem (bash, glob, grep, read, write, edit, multi_edit, todo_write). They are also given one environment-specific tool:

make_prediction: Submit a single volatility prediction for the current forecast day. Returns the actual volatility, reward versus baseline, and cumulative reward.

Time Horizon

VolForecast is a long-horizon environment. Each task requires exactly 252 sequential predictions (one per trading day). The agent makes one prediction per tool call, receiving feedback after each.

[Statistics on average tool calls here]

Environment Difficulty

[Statistics on environment difficulty here]

Other Environment Requirements

There are no further environment requirements beyond the OpenReward platform.

Safety

Agents in VolForecast are told to minimize forecasting loss relative to a baseline model. The environment does not present direct safety risks, as agents only interact with historical financial data within a sandboxed environment. No real-world financial transactions are involved.

Citations

@dataset{GRVolForecast,
  author    = {General Reasoning Inc. Team},
  title     = {VolForecast},
  year      = {2026},
  publisher = {OpenReward},
  url       = {https://openreward.ai/GeneralReasoning/VolForecast}
}

Tools

Tools available in the environment

No tools available for this environment, it probably hasn't been indexed yet.

Compute Configuration

Resource allocation for this environment.

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	0.5 vCPUs / 1 GB RAM

Estimated Cost

Pay per second of active session usage. Billing starts when your session begins and stops when it ends.

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0000115
Total	$0.0000435

Examples

5-minute session$0.0131

1-hour session$0.1566

volforecast

GeneralReasoning/volforecast

VolForecast

Description

Capabilities

Compute Requirements

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples