volforecast
VolForecast
Description
VolForecast is an environment for evaluating language model agents on financial volatility forecasting tasks. Agents analyze historical market data and develop forecasting strategies to predict daily realized volatility (squared log returns) across different economic periods and market conditions, one trading day at a time over a 252-day horizon.
Capabilities
- Analyzing historical price data and market indicators
- Developing volatility forecasting models and strategies
- Sequential decision-making over a 252-step forecast horizon
- Iterating on predictions based on feedback from prior days
- Long-horizon multi-turn execution
Compute Requirements
Agents in VolForecast are given a sandbox with 0.5 CPUs and 1GB RAM, network access enabled, and a Python 3.12 data science image.
Tasks
There are 130 tasks across 4 economic scenarios and 2 splits.
| Split | Scenarios | Tickers | Tasks |
|---|---|---|---|
| train | Pre-Crisis, Financial Crisis, Recovery | 40 | 120 |
| test | Covid-Era | 10 | 10 |
Scenarios:
- Pre-Crisis (train): Historical data 2005-2007, forecasting from 2008
- Financial Crisis (train): Historical data 2006-2008, forecasting from 2009
- Recovery (train): Historical data 2008-2015, forecasting from 2016
- Covid-Era (test): Historical data 2018-2021, forecasting from 2022
Each scenario contains multiple tickers (individual stocks), resulting in one task per ticker per scenario.
Each task requires the agent to make 252 sequential daily volatility predictions (one year of trading days). The agent receives historical price data and market indicators, and must submit one prediction at a time via make_prediction. After each prediction, the agent receives the actual realized volatility and a reward signal comparing its prediction to a risk metrics baseline.
Reward Structure
This is a dense, verifiable reward environment. After each daily prediction, the reward is the difference between the baseline's combined loss and the agent's combined loss:
where the combined loss is the average of Mean Squared Error (MSE) and QLIKE (Quasi-Maximum Likelihood) loss:
Positive reward means the agent outperformed the baseline; negative means it underperformed. The final reward returned at task completion is the cumulative sum of all 252 daily rewards.
We do not use LLM graders for this task.
Data
Agents are provided with historical price data and market indicators mounted in the sandbox at /tmp/gr-datasets/. Data includes daily closing prices from which the agent must compute realized volatility (squared log returns). The baseline uses an exponentially weighted moving average model (lambda=0.94).
Tools
Agents are given access to CLI tools for creating, viewing, and searching a filesystem (bash, glob, grep, read, write, edit, multi_edit, todo_write). They are also given one environment-specific tool:
make_prediction: Submit a single volatility prediction for the current forecast day. Returns the actual volatility, reward versus baseline, and cumulative reward.
Time Horizon
VolForecast is a long-horizon environment. Each task requires exactly 252 sequential predictions (one per trading day). The agent makes one prediction per tool call, receiving feedback after each.
[Statistics on average tool calls here]
Environment Difficulty
[Statistics on environment difficulty here]
Other Environment Requirements
There are no further environment requirements beyond the OpenReward platform.
Safety
Agents in VolForecast are told to minimize forecasting loss relative to a baseline model. The environment does not present direct safety risks, as agents only interact with historical financial data within a sandboxed environment. No real-world financial transactions are involved.
Citations
@dataset{GRVolForecast,
author = {General Reasoning Inc. Team},
title = {VolForecast},
year = {2026},
publisher = {OpenReward},
url = {https://openreward.ai/GeneralReasoning/VolForecast}
}