KellyBench
KellyBench
Description
KellyBench is an open-ended, non-stationary environment for evaluating sequential decision-making in sports betting markets. Agents are placed in a simulated market for a full English Premier League season and tasked with maximising long-term bankroll growth. To succeed, they must develop machine learning models from historical data, identify edge relative to public betting markets, manage risk, and adapt as the season unfolds.
The benchmark is named after the Kelly criterion (Kelly, 1956), which links predictive calibration to the long-run geometric growth rate of capital.
Capabilities
- Developing machine learning models for sports prediction
- Backtesting models against real public betting markets
- Kelly-style bankroll management and staking
- Iterative model development and in-season adaptation
- Long-horizon multi-turn execution (500–900 tool calls per episode)
- Coherent closed-loop reasoning over a full season
Compute Requirements
Agents in KellyBench are given a sandbox with 4 CPUs and 16GB of RAM, preinstalled with a standard Python data-science stack (NumPy, pandas, scikit-learn).
License
Open-access endpoint (please consult authors if you want to run privately).
Tasks
There are three training tasks in this environment:
- New Millennium: the agent bets for the 2000/2001 season (£100 starting bankroll, 97 matchdays).
- Post-Crash: the agent bets for the 2010/2011 season (£150 starting bankroll, 105 matchdays).
- Covid Season: the agent bets for the 2020/2021 season (£200 starting bankroll, 148 matchdays).
And one test task:
- Recent Season: the agent bets for the 2023/2024 season (£220 starting bankroll, 120 matchdays).
Each task lasts the entire season and concludes after the final matchday. On each matchday, the agent observes the day's fixtures with closing bookmaker odds, develops or updates its model, places one or more bets, and advances to the next matchday. The agent can bet on match results (home/draw/away) or total goals (over/under 2.5). To prevent the agent from trivially opting out, the environment requires at least one bet (amount > 0) per matchday; penny bets are allowed as a capital-preservation strategy.
Reward Structure
This is a dense, fully verifiable reward environment. After each matchday , the reward is the change in log-wealth:
where is the bankroll at the start of the matchday and the bankroll after settlement. The cumulative reward over a season is therefore , i.e. the log-ratio of final to initial wealth. Rewards are computed deterministically from real match outcomes and bookmaker odds, so no LLM grader is required.
Data
Agents have access to two categories of historical data, with disclosure progressing one matchday at a time to mirror the information structure of a real quantitative bettor:
- Match-level data: a longitudinal dataset of English Premier League matches from the 1993/94 season onwards. Each record contains the date, teams, and full-time result, with coverage broadening over time: half-time scores from 1995/96; shot counts, fouls, corners, cards and referee from 2000/01; and pre-kickoff decimal odds from multiple bookmakers (1X2, over/under, Asian handicap) from 2002/03 onwards. Ground-truth odds used by the environment are closing (pre-match, near kickoff).
- Player-level data: per-match player statistics for major European leagues and cups from 2008 onwards (Premier League, Championship, La Liga, Serie A, Bundesliga, Ligue 1, domestic cups and UEFA club competitions), including lineups, goals, assists, minutes, shots, cards, tackles, interceptions, expected goals (where available), and player information (age, height, position).
After each matchday, the environment delivers the latest results and player statistics for the matches just completed, which the agent can incorporate into subsequent model iterations.
Tools
The agent interacts with KellyBench through two categories of tools.
Environment tools (4):
view_matches– displays the current matchday's fixtures and bookmaker oddsplace_bet– places a wager on a specified match, market, and stakeview_bankroll– reports current balance and outstanding stakesnext_matchday– settles all bets, delivers results, downloads updated data, and advances to the next matchday
CLI tools (7): bash, glob, grep, read, write, edit, todo_write. These mirror the Claude Code toolset exactly, allowing the agent to write Python scripts, train models, inspect data, and organise its workflow inside the sandbox.
Time Horizon
KellyBench is an open-ended, long-horizon environment simulating an entire season of model development and trading. Across the models evaluated in the KellyBench paper, episodes used 500–900 tool calls and 30–500 million tokens per seed, so effective context management and compaction are important. One seed of GPT-5.4 (xhigh) cost approximately $2,000 to complete, making KellyBench one of the more expensive benchmarks to run at frontier-model scale.
Environment Difficulty
KellyBench is unsaturated. In the KellyBench paper, every frontier model evaluated lost money on average across three seeds on the 2023/24 Recent Season task, with many experiencing ruin:
| Model | Mean ROI | Best Seed | Mean Final Bankroll | Avoided Ruin |
|---|---|---|---|---|
| Claude Opus 4.6 | −11.0% | +21.5% | £89,035 | Yes |
| GPT-5.4 | −13.6% | −4.1% | £86,365 | Yes |
| Gemini 3.1 Pro | −43.3% | +33.7% | £56,715 | No |
| Gemini 3.1 Flash | −58.4% | +24.7% | £41,605 | No |
| GLM-5 | −58.8% | −14.3% | £41,221 | No |
| Kimi K2.5 | −68.3% | −27.3% | £31,738 | No |
| Arcee Trinity | −84.2% | −52.7% | £15,773 | No |
| Grok 4.20 | −88.2% | −64.5% | £11,814 | No |
Starting bankrolls are normalised to £100,000 for display. Only 3 of 24 model seeds achieved a positive return on investment.
To judge strategy quality independently of backtest variance, the paper introduces a 44-point expert-curated sophistication rubric (covering features, staking, non-stationarity handling, and execution). All models scored under 50%, with Claude Opus 4.6 highest at 32.6%. Sophistication is positively correlated with ROI and with avoiding ruin, suggesting substantial headroom for improvement.
Common failure modes observed include: disconnect between reasoned strategy and executed code (Kelly staking discussed but not implemented), inability to handle newly-promoted teams with limited historical data, systematic miscalibration of draws and longshots, absence of in-season model retraining or strategic pivots, and premature task termination.
Other Environment Requirements
KellyBench requires no external API keys beyond the OpenReward endpoint.
Safety
Agents in KellyBench are told to maximise their long-run bankroll growth. The environment does not present direct safety risks, as agents only interact with an artificial backtest world through betting decisions against historical public odds. This contrasts with other money-maximising benchmarks, such as Vending-Bench-2, where agents have the opportunity to manipulate suppliers and other agents in the simulation.
There may be indirect risks, however, in that an agent trained to maximise long-run wealth may blindly follow this objective when tested in other environments, leading it to pursue unethical objectives. Our advice is that multi-environment training runs involving KellyBench should include other environments that teach agents to respect ethical norms so the agent understands a broader category of objectives than just maximising wealth.
Citations
@article{grady2026kellybench,
author = {Thomas Grady and Kip Parker and Iliyan Zarov and Henry Course and Anthony Hartshorn and Chengxi Taylor and Ross Taylor},
title = {KellyBench: Can Language Models Beat the Market?},
year = {2026},
institution = {General Reasoning, Inc.},
url = {https://openreward.ai/GeneralReasoning/KellyBench}
}