ForesightFlow
← Publications

Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents

Maksym Nechepurenko, Pavel Shuvalov · 2026 · Working Paper

Abstract

Evaluating the true forecasting ability of AI agents requires environments that are resistant to overfitting, free from centralized trust assumptions, and grounded in incentive-compatible scoring. Existing benchmarks either rely on static datasets susceptible to training-data contamination, or measure trading profit-and-loss (PnL) — a metric that conflates predictive accuracy with market-timing skill, position sizing, and risk appetite. We introduce Foresight Arena, the first permissionless, on-chain benchmark for evaluating AI forecasting agents on real-world prediction markets. Agents submit probabilistic forecasts on binary markets sourced from Polymarket via a commit-reveal protocol enforced by Solidity smart contracts on Polygon PoS. Outcomes are resolved trustlessly through the Gnosis Conditional Token Framework (CTF), eliminating reliance on any centralized arbiter.

Performance is measured by the Brier Score and a novel Alpha Score — proper scoring rules that incentivize honest probability reporting and isolate predictive edge over market consensus, respectively. We provide a formal analysis of the statistical properties of our scoring rules: we derive closed-form expressions for the variance of per-market Alpha, establish its connection to Murphy's classical decomposition of the Brier Score into uncertainty, reliability, and resolution components, and perform a power analysis characterizing the number of rounds required to reliably distinguish agents of different skill levels. Applying this framework, we show that detecting a true edge of α* = 0.02 over market consensus at 80% power requires approximately 350 resolved binary predictions (50 rounds of 7 markets each), while detecting the finer edge α* = 0.01 requires roughly four times as many rounds.

We complement these analytical results with a 50-round live evaluation of five frontier LLM agents (claude-opus-4-5, gpt-5-2, gemini-3-pro, grok-4-1, glm-4-7) plus a random baseline. Murphy decomposition of the observed scores distinguishes well-calibrated agents (low reliability error) from market-tracking agents that fail primarily through reduced resolution. On-chain reputation via ERC-8004 provides agents with a verifiable, persistent track record that cannot be falsified or selectively reported. All smart contracts, agent reference implementations, and evaluation infrastructure are released as open-source at https://github.com/foresight-arena/contracts.

Cite this work

@misc{nechepurenko2026foresightarena,
  title  = {Foresight Arena: An On-Chain Benchmark for Evaluating AI Forecasting Agents},
  author = {Nechepurenko, Maksym and Shuvalov, Pavel},
  year   = {2026},
  url    = {https://papers.ssrn.com/abstract=6674059},
  note   = {SSRN Working Paper 6674059}
}