About

Aleatoric is a benchmark for probabilistic forecasting by language models. It evaluates how accurately and how calibratedly frontier models predict real-world events that have not yet been decided. Each question carries a precise resolution criterion, is scored against the realized outcome with a kind-appropriate proper scoring rule, and is re-forecast on a fixed cadence as new information arrives.

The benchmark targets a capability that retrieval and recall benchmarks do not measure: open-ended reasoning under uncertainty over future events. A question’s answer is not present in training data and cannot be looked up. Strong scores require base-rate reasoning, structural evidence aggregation, calibrated uncertainty, and the ability to update beliefs in degrees rather than jumps.

Scope

Aleatoric supports five question kinds:

binary: a probability over a yes/no outcome.
numeric: a CDF over a continuous quantity, expressed as 21 quantile values on a fixed probability grid (0.02, 0.05, 0.10, …, 0.90, 0.95, 0.98).
date: a CDF over the date on which an event resolves, structured identically to numeric.
choice: an independent probability per listed option (the vector need not sum to 1 — the residual is the implicit mass on "none of these listed options").
discrete: a probability vector over ordered bins.

Questions are curated for decision relevance and genuine uncertainty. Each carries a resolution criterion, an optional resolution deadline, and an update cadence between 1 and 14 days.

Methodology

Each question is forecast by two cohorts in parallel.

Aleatoric a0.1

A deep-research ensemble built on a ReAct agent. Members include Claude Opus, Gemini Pro, GPT-5.5, and Grok. Each member follows a structured superforecaster workflow: operationalize the resolution criterion, anchor on a base rate from a historical reference class, gather structural evidence across causal branches, search for disconfirming evidence, optionally construct a causal Monte Carlo model of the underlying mechanism, then commit a forecast with a 90% credible interval and a written thesis. A chair model synthesizes the members into a single ensemble forecast.

Participant models

A wider set of frontier models running the same scaffold as solo agents (Claude, Gemini, GPT, Grok, Kimi, Qwen, and others). Each model is queried independently through OpenRouter and produces its own forecast. Participants form the leaderboard field.

Tools

Both cohorts have access to:

web_search and extract_page, with the temporal cutoff clamped to the present and prediction-market and forecast-aggregator domains filtered at the search layer.
Subagents for reference-class enumeration, Fermi estimation, dataset compilation, and structured numeric retrieval.
run_code, a Python sandbox with NumPy, SciPy, Pandas, and sim, a probabilistic toolkit. sim provides samplers (beta-PERT, lognormal, triangular, Bernoulli, Poisson), a sensitivity tornado that computes rank-correlation between driver variables and the outcome, Bayesian odds updating, and quantile output aligned to the forecast grid.

Belief updates

On each due cycle, the agent reads its prior forecast, identifies what has materially changed since then, and updates in proportion. Updates are forward-looking: the agent forecasts the probability over the remaining window between the current date and the resolution deadline, not over events already known to have occurred.

Scoring

Each forecast is evaluated against the realized outcome using a kind-appropriate variant of the Brier family of proper scoring rules. Proper rules incentivize honest probability reporting: the model minimizes its expected score by stating its true belief. All variants reduce to the standard Brier score on the binary case. Lower is better.

Binary

With predicted probability p ∈ [0, 1] and outcome y ∈ {0, 1}:

score = (p − y)²

Range [0, 1]. Brier (1950). Human superforecasters average approximately 0.085 on mixed binary questions (Tetlock and Gardner, 2015).

Multiple choice

With probability vector p over N options and one-hot outcome y:

score = ½ · Σᵢ (pᵢ − yᵢ)²

Multinomial Brier, normalized to [0, 1].

Ordered discrete bins

With probability vector p over K ordered bins, cumulative CDF Fₖ = Σⱼ≤ₖ pⱼ, and realized bin index y:

score = (1 / (K − 1)) · Σₖ (Fₖ − 1{y ≤ k})²

Ranked Probability Score (Epstein, 1969). Rewards proximity on ordinal outcomes: a near-miss on an adjacent bin scores substantially better than a miss across the range.

Numeric and date

Forecasts are expressed as 21 quantile values aligned to the fixed probability grid. The implied CDF F is piecewise linear between knots, with F = 0 below the lowest quantile and F = 1 above the highest. The score is the Continuous Ranked Probability Score:

score = ∫ (F(x) − 1{x ≥ y})² dx

Computed in closed form from the piecewise-linear CDF, in the natural units of the predicted quantity. CRPS reduces to Brier on the binary case (Gneiting and Raftery, 2007).

Aggregation

To rank models across questions of heterogeneous kinds, each per-question score is normalized to [0, 1]. Binary, choice, and discrete scores are already bounded in that range. CRPS for numeric and date is normalized by the pooled q98 − q02 across all models’ forecasts on the question. The leaderboard ranks models by the mean normalized score across resolved questions, with ties broken by sample count.

Choice of scoring family

Aleatoric uses the Brier family rather than log-loss for two reasons.

Boundedness. Log-loss −log p(y) is unbounded: a single near-zero probability on the realized outcome yields infinite penalty and dominates the leaderboard regardless of performance on other questions. The Brier family bounds per-question scores in [0, 1].

Partial credit on ordered and continuous outcomes. CRPS and RPS reward proximity, which corresponds to the intuition that calibrated near-miss forecasts are more informative than far-miss ones.

The tradeoff is reduced sensitivity to overconfidence relative to log-loss. Aleatoric accepts this in exchange for stable cross-question aggregation when individual models occasionally assign near-zero probability to realized outcomes.

Limitations

Resolution at scale. Question resolution is currently performed manually against primary sources. Automated resolution that does not introduce search-induced contamination remains an open research problem.

Sample size and calendar time. Questions become scoreable only after they resolve in the world. Sample size grows on calendar time, not on compute. Calibration metrics (reliability diagrams, sharpness, log skill scores) are sensitive to small samples and will not be reported until the resolved-question set crosses an adequate threshold.

Curation bias. Questions are selected for decision relevance and genuine uncertainty, not as a uniform sample of forecasting problems. This editorial selection is part of the methodology and not corrected for.

Model coverage. The participant set is limited to models exposed through OpenRouter and direct integrations. Release-date metadata for the capability-over-time plot is best-effort and updated as authoritative dates become available.

Maintainer

Aleatoric is developed and maintained by Agape Keleta. Methodology, code, and corpus are open. Contact: X / Twitter.

References

Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1).
Epstein, E. S. (1969). A Scoring System for Probability Forecasts of Ranked Categories. Journal of Applied Meteorology, 8(6).
Gneiting, T., and Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102.
Keelin, T. W. (2016). The Metalog Distributions. Decision Analysis, 13(4).
Tetlock, P. E., and Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.
Halawi, D., et al. (2024). Approaching Human-Level Forecasting with Language Models. arXiv:2402.18563.
Karger, E., et al. (2024). ForecastBench. arXiv:2409.19839.