AP Yonhap News
A study found that major advanced artificial intelligence (AI) models all posted losses in a virtual betting experiment that recreated one season of the English Premier League.
On the 12th, according to industry sources, the UK AI startup Generalizing released the paper ‘KellyBench’, which reports results from testing eight leading AI systems in a simulated environment that replayed the 2023-2024 Premier League season. The models tested included OpenAI GPT-5.4, Anthropic Claude Opus 4.6, Google Gemini 3.1 Pro, and xAI Grok 4.20.
The researchers provided each AI with detailed data on past matches and players, then instructed it to build a model to maximize returns and manage risk. On each match day, the AI was made to place at least one bet by choosing types such as match result and goals. To prevent advance access to outcomes, internet access was blocked. After each match, results and player-level detailed statistics were provided so the models could use them for improvement. The test was conducted three times per model, and each attempt began with initial capital of 100,000 pounds (approximately 200 million KRW).
Claude Opus 4.6 showed the best performance, but its average return was negative (-11%). Even the best result across its three runs was only −0.2%. The only ones that avoided bankruptcy were Claude Opus 4.6 and GPT-5.4 (average return −13.6%).
Other models either lost all initial funds at least once or failed to complete the betting itself. Gemini 3.1 Pro, which recorded an average return of -43.3%, achieved a 34% gain once but experienced bankruptcy in another run. Grok 4.20 went bankrupt once, and the other two runs were not completed.
The researchers explained that while AI demonstrates strong abilities on procedural tasks with clear goals, its performance in environments that keep changing and have no single correct answer, as in the real world, has not been properly verified. The researchers said, “This study has several limitations, but current AI models are, overall, underperforming humans.” Just as an athlete can see performance change after a long-term injury, in situations where the environment continues to change over time, AI shows limits in responding. However, this paper has not yet undergone peer review.
Ross Taylor, CEO of Generalizing, told the Financial Times, “Expectations for AI automation are high, but there are not many attempts to evaluate AI in long-term environments,” emphasizing the need for evaluations that reflect real-world complexity.