KellyBench

API Endpoint
Leaderboard
Loading leaderboard...
README

KellyBench

⭐ OpenReward Environment

Description

KellyBench is an open-ended, non-stationary environment for evaluating sequential decision-making in sports betting markets. Agents are placed in a simulated market for a full English Premier League season and tasked with maximising long-term bankroll growth. To succeed, they must develop machine learning models from historical data, identify edge relative to public betting markets, manage risk, and adapt as the season unfolds.

The benchmark is named after the Kelly criterion (Kelly, 1956), which links predictive calibration to the long-run geometric growth rate of capital.

Capabilities

  • Developing machine learning models for sports prediction
  • Backtesting models against real public betting markets
  • Kelly-style bankroll management and staking
  • Iterative model development and in-season adaptation
  • Long-horizon multi-turn execution (500–900 tool calls per episode)
  • Coherent closed-loop reasoning over a full season

Compute Requirements

Agents in KellyBench are given a sandbox with 4 CPUs and 16GB of RAM, preinstalled with a standard Python data-science stack (NumPy, pandas, scikit-learn).

License

Open-access endpoint (please consult authors if you want to run privately).

Tasks

There are three training tasks in this environment:

  • New Millennium: the agent bets for the 2000/2001 season (£100 starting bankroll, 97 matchdays).
  • Post-Crash: the agent bets for the 2010/2011 season (£150 starting bankroll, 105 matchdays).
  • Covid Season: the agent bets for the 2020/2021 season (£200 starting bankroll, 148 matchdays).

And one test task:

  • Recent Season: the agent bets for the 2023/2024 season (£220 starting bankroll, 120 matchdays).

Each task lasts the entire season and concludes after the final matchday. On each matchday, the agent observes the day's fixtures with closing bookmaker odds, develops or updates its model, places one or more bets, and advances to the next matchday. The agent can bet on match results (home/draw/away) or total goals (over/under 2.5). To prevent the agent from trivially opting out, the environment requires at least one bet (amount > 0) per matchday; penny bets are allowed as a capital-preservation strategy.

Reward Structure

This is a dense, fully verifiable reward environment. After each matchday tt, the reward is the change in log-wealth:

rt=logWt+1logWtr_t = \log W_{t+1} - \log W_t

where WtW_t is the bankroll at the start of the matchday and Wt+1W_{t+1} the bankroll after settlement. The cumulative reward over a season is therefore log(WT+1/W1)\log(W_{T+1} / W_1), i.e. the log-ratio of final to initial wealth. Rewards are computed deterministically from real match outcomes and bookmaker odds, so no LLM grader is required.

Data

Agents have access to two categories of historical data, with disclosure progressing one matchday at a time to mirror the information structure of a real quantitative bettor:

  • Match-level data: a longitudinal dataset of English Premier League matches from the 1993/94 season onwards. Each record contains the date, teams, and full-time result, with coverage broadening over time: half-time scores from 1995/96; shot counts, fouls, corners, cards and referee from 2000/01; and pre-kickoff decimal odds from multiple bookmakers (1X2, over/under, Asian handicap) from 2002/03 onwards. Ground-truth odds used by the environment are closing (pre-match, near kickoff).
  • Player-level data: per-match player statistics for major European leagues and cups from 2008 onwards (Premier League, Championship, La Liga, Serie A, Bundesliga, Ligue 1, domestic cups and UEFA club competitions), including lineups, goals, assists, minutes, shots, cards, tackles, interceptions, expected goals (where available), and player information (age, height, position).

After each matchday, the environment delivers the latest results and player statistics for the matches just completed, which the agent can incorporate into subsequent model iterations.

Tools

The agent interacts with KellyBench through two categories of tools.

Environment tools (4):

  • view_matches – displays the current matchday's fixtures and bookmaker odds
  • place_bet – places a wager on a specified match, market, and stake
  • view_bankroll – reports current balance and outstanding stakes
  • next_matchday – settles all bets, delivers results, downloads updated data, and advances to the next matchday

CLI tools (7): bash, glob, grep, read, write, edit, todo_write. These mirror the Claude Code toolset exactly, allowing the agent to write Python scripts, train models, inspect data, and organise its workflow inside the sandbox.

Time Horizon

KellyBench is an open-ended, long-horizon environment simulating an entire season of model development and trading. Across the models evaluated in the KellyBench paper, episodes used 500–900 tool calls and 30–500 million tokens per seed, so effective context management and compaction are important. One seed of GPT-5.4 (xhigh) cost approximately $2,000 to complete, making KellyBench one of the more expensive benchmarks to run at frontier-model scale.

Environment Difficulty

KellyBench is unsaturated. In the KellyBench paper, every frontier model evaluated lost money on average across three seeds on the 2023/24 Recent Season task, with many experiencing ruin:

ModelMean ROIBest SeedMean Final BankrollAvoided Ruin
Claude Opus 4.6−11.0%+21.5%£89,035Yes
GPT-5.4−13.6%−4.1%£86,365Yes
Gemini 3.1 Pro−43.3%+33.7%£56,715No
Gemini 3.1 Flash−58.4%+24.7%£41,605No
GLM-5−58.8%−14.3%£41,221No
Kimi K2.5−68.3%−27.3%£31,738No
Arcee Trinity−84.2%−52.7%£15,773No
Grok 4.20−88.2%−64.5%£11,814No

Starting bankrolls are normalised to £100,000 for display. Only 3 of 24 model seeds achieved a positive return on investment.

To judge strategy quality independently of backtest variance, the paper introduces a 44-point expert-curated sophistication rubric (covering features, staking, non-stationarity handling, and execution). All models scored under 50%, with Claude Opus 4.6 highest at 32.6%. Sophistication is positively correlated with ROI and with avoiding ruin, suggesting substantial headroom for improvement.

Common failure modes observed include: disconnect between reasoned strategy and executed code (Kelly staking discussed but not implemented), inability to handle newly-promoted teams with limited historical data, systematic miscalibration of draws and longshots, absence of in-season model retraining or strategic pivots, and premature task termination.

Other Environment Requirements

KellyBench requires no external API keys beyond the OpenReward endpoint.

Safety

Agents in KellyBench are told to maximise their long-run bankroll growth. The environment does not present direct safety risks, as agents only interact with an artificial backtest world through betting decisions against historical public odds. This contrasts with other money-maximising benchmarks, such as Vending-Bench-2, where agents have the opportunity to manipulate suppliers and other agents in the simulation.

There may be indirect risks, however, in that an agent trained to maximise long-run wealth may blindly follow this objective when tested in other environments, leading it to pursue unethical objectives. Our advice is that multi-environment training runs involving KellyBench should include other environments that teach agents to respect ethical norms so the agent understands a broader category of objectives than just maximising wealth.

Citations

@article{grady2026kellybench,
  author    = {Thomas Grady and Kip Parker and Iliyan Zarov and Henry Course and Anthony Hartshorn and Chengxi Taylor and Ross Taylor},
  title     = {KellyBench: Can Language Models Beat the Market?},
  year      = {2026},
  institution = {General Reasoning, Inc.},
  url       = {https://openreward.ai/GeneralReasoning/KellyBench}
}
GeneralReasoning/KellyBench | OpenReward