About

Aleatoric is a research bench measuring how well language models forecast real-world events. Every question tests genuine prediction under uncertainty, a task that cannot be solved by retrieval or memorization.

What we measure

Each forecast is scored on three axes.

  • Accuracy. Brier score on resolved YES/NO outcomes. Lower is better. Human superforecasters average roughly 0.085 on mixed binary questions.
  • Calibration. Does the model's stated 90% credible interval actually contain the truth 90% of the time? Overconfidence is punished quadratically. A confident wrong answer costs far more than a hedged one.
  • Disagreement. When models reach different probabilities on the same question, that reveals genuine uncertainty, and uneven reasoning quality across systems.

Why this matters

Most high-stakes decisions are probabilistic forecasts in disguise. Central banks set policy on inflation forecasts. Drug developers allocate billions against trial-success odds. Governments prepare for pandemics, conflicts, and climate shocks based on tail probabilities. Investors, founders, and policymakers all operate on implicit probabilities they rarely make explicit.

Today the best forecasters are a rare few hundred humans. Tetlock's superforecasters outperform intelligence analysts on geopolitical questions, but they are a scarce resource and their attention does not scale. Automated forecasting at superforecaster quality would be genuinely useful. It would democratize calibrated foresight for the questions humans never reach, give small organizations and individuals the reasoning tools currently available only to well-resourced institutions, and make the chain of reasoning auditable and repeatable in ways human forecasts cannot be.

If LLMs can reach this bar, they become something more than chatbots or retrieval tools. They become decision-grade reasoners over the real world.

Why this is a real reasoning test

Most LLM benchmarks (MMLU, GPQA, HumanEval) test recall and retrieval. Their answers exist somewhere in training data or on the web. Forecasting is different. The answer to "will the Fed cut rates at the June FOMC?" does not exist yet. It cannot be memorized, and it cannot be looked up.

Producing a calibrated probability requires the full stack of general reasoning: decomposing a complex question into tractable sub-questions, pulling base rates from historical reference classes, weighing conflicting evidence across long horizons, updating priors in degrees rather than jumps, and distinguishing between what the model knows and what it does not. Forecasting forces models into open-ended, long-horizon, multi-source reasoning that tightly couples to how well they model the world.

Strong performance here is meaningful signal. It resists the usual benchmark failure modes: training-data contamination, memorization, and optimization against a fixed answer key.

How it works

Each run sends the question to a superforecaster-style agent. It operationalizes the resolution criteria, establishes a base rate from historical reference classes, gathers structural evidence across multiple causal branches, actively seeks disconfirming data, then commits to a point estimate and a 90% credible interval with cited reasoning.

The agent has tools for web search, full-page extraction, Fermi estimation, reference-class enumeration, and structured dataset compilation. Prediction-market, tournament, and forecast-publishing sources are filtered out at the search layer, so the model produces an independent estimate rather than anchoring on an existing market.

Further reading