KellyBench

Description

KellyBench is an open-ended, non-stationary environment for evaluating sequential decision-making in sports betting markets. Agents are placed in a simulated market for a full English Premier League season and tasked with maximising long-term bankroll growth. To succeed, they must develop machine learning models from historical data, identify edge relative to public betting markets, manage risk, and adapt as the season unfolds.

The benchmark is named after the Kelly criterion (Kelly, 1956), which links predictive calibration to the long-run geometric growth rate of capital.

Capabilities

Developing machine learning models for sports prediction
Backtesting models against real public betting markets
Kelly-style bankroll management and staking
Iterative model development and in-season adaptation
Long-horizon multi-turn execution (500–900 tool calls per episode)
Coherent closed-loop reasoning over a full season

Compute Requirements

Agents in KellyBench are given a sandbox with 4 CPUs and 16GB of RAM, preinstalled with a standard Python data-science stack (NumPy, pandas, scikit-learn).

License

Open-access endpoint (please consult authors if you want to run privately).

Tasks

There are three training tasks in this environment:

New Millennium: the agent bets for the 2000/2001 season (£100 starting bankroll, 97 matchdays).
Post-Crash: the agent bets for the 2010/2011 season (£150 starting bankroll, 105 matchdays).
Covid Season: the agent bets for the 2020/2021 season (£200 starting bankroll, 148 matchdays).

And one test task:

Recent Season: the agent bets for the 2023/2024 season (£220 starting bankroll, 120 matchdays).

Each task lasts the entire season and concludes after the final matchday. On each matchday, the agent observes the day's fixtures with closing bookmaker odds, develops or updates its model, places one or more bets, and advances to the next matchday. The agent can bet on match results (home/draw/away) or total goals (over/under 2.5). To prevent the agent from trivially opting out, the environment requires at least one bet (amount > 0) per matchday; penny bets are allowed as a capital-preservation strategy.

Reward Structure

This is a dense, fully verifiable reward environment. After each matchday $t$ , the reward is the change in log-wealth:

$r_t = \log W_{t+1} - \log W_t$

where $W_t$ is the bankroll at the start of the matchday and $W_{t+1}$ the bankroll after settlement. The cumulative reward over a season is therefore $\log(W_{T+1} / W_1)$ , i.e. the log-ratio of final to initial wealth. Rewards are computed deterministically from real match outcomes and bookmaker odds, so no LLM grader is required.

Data

Agents have access to two categories of historical data, with disclosure progressing one matchday at a time to mirror the information structure of a real quantitative bettor:

Match-level data: a longitudinal dataset of English Premier League matches from the 1993/94 season onwards. Each record contains the date, teams, and full-time result, with coverage broadening over time: half-time scores from 1995/96; shot counts, fouls, corners, cards and referee from 2000/01; and pre-kickoff decimal odds from multiple bookmakers (1X2, over/under, Asian handicap) from 2002/03 onwards. Ground-truth odds used by the environment are closing (pre-match, near kickoff).
Player-level data: per-match player statistics for major European leagues and cups from 2008 onwards (Premier League, Championship, La Liga, Serie A, Bundesliga, Ligue 1, domestic cups and UEFA club competitions), including lineups, goals, assists, minutes, shots, cards, tackles, interceptions, expected goals (where available), and player information (age, height, position).

After each matchday, the environment delivers the latest results and player statistics for the matches just completed, which the agent can incorporate into subsequent model iterations.

Tools

The agent interacts with KellyBench through two categories of tools.

Environment tools (4):

view_matches – displays the current matchday's fixtures and bookmaker odds
place_bet – places a wager on a specified match, market, and stake
view_bankroll – reports current balance and outstanding stakes
next_matchday – settles all bets, delivers results, downloads updated data, and advances to the next matchday

CLI tools (7): bash, glob, grep, read, write, edit, todo_write. These mirror the Claude Code toolset exactly, allowing the agent to write Python scripts, train models, inspect data, and organise its workflow inside the sandbox.

Time Horizon

KellyBench is an open-ended, long-horizon environment simulating an entire season of model development and trading. Across the models evaluated in the KellyBench paper, episodes used 500–900 tool calls.

Environment Difficulty

KellyBench is unsaturated. In the KellyBench paper, every frontier model evaluated lost money on average across five seeds.

To judge strategy quality independently of backtest variance, the paper introduces a 52-point expert-curated sophistication rubric (covering features, staking, non-stationarity handling, and execution). All models scored under 50%, with Claude Opus 4.6 highest at 32.6%. Sophistication is positively correlated with ROI and with avoiding ruin, suggesting substantial headroom for improvement.

Common failure modes observed include: disconnect between reasoned strategy and executed code (Kelly staking discussed but not implemented), inability to handle newly-promoted teams with limited historical data, systematic miscalibration of draws and longshots, absence of in-season model retraining or strategic pivots, and premature task termination.

Other Environment Requirements

KellyBench requires no external API keys beyond the OpenReward endpoint.

Safety

Agents in KellyBench are told to maximise their long-run bankroll growth. The environment does not present direct safety risks, as agents only interact with an artificial backtest world through betting decisions against historical public odds. This contrasts with other money-maximising benchmarks, such as Vending-Bench-2, where agents have the opportunity to manipulate suppliers and other agents in the simulation.

There may be indirect risks, however, in that an agent trained to maximise long-run wealth may blindly follow this objective when tested in other environments, leading it to pursue unethical objectives. Our advice is that multi-environment training runs involving KellyBench should include other environments that teach agents to respect ethical norms so the agent understands a broader category of objectives than just maximising wealth.

Citations

@article{grady2026kellybench,
  author    = {Thomas Grady and Kip Parker and Iliyan Zarov and Henry Course and Anthony Hartshorn and Chengxi Taylor and Ross Taylor},
  title     = {KellyBench: Can Language Models Beat the Market?},
  year      = {2026},
  institution = {General Reasoning, Inc.},
  url       = {https://openreward.ai/GeneralReasoning/KellyBench}
}

Component	Configuration
Environment Server	1 vCPU / 4 GB RAM
Sandbox Machine	4 vCPUs / 16 GB RAM

Component	Cost / second
Environment	$0.0000320
Sandbox	$0.0001280
Total	$0.0001600

KellyBench

GeneralReasoning/KellyBench

KellyBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citations

Tools

Compute Configuration

Estimated Cost

Examples