1 of 32

  • Recent AI forecasting results imply we’ll soon have abundant, rapid, high-quality forecasting.

  • Superforecasting ‘too cheap to meter’ can unlock major use cases in science and government
    • Government by Active Inference
    • FDA
    • Human Subject Research
    • Potential Scary Stuff

A Predictive Revolution?

Josh Morrison

Manifest

6.9.2024

2 of 32

Recent AI Forecasting Results

Title

Basic Result

Halawi, 2024

Approaching Human-Level Forecasting with Language Models

An automated LLM forecasting process comes close (Brier score .179) to matching human crowds (.149) on a database of binary questions.

Tetlock, 2024

Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy

Averaging different model answers to forecasting questions achieve similar accuracy to averaging human forecasters in a Metaculus tournament.

Luo, 2024

Large language models surpass human experts in predicting neuroscience results

An LLM tuned on neuroscience literature outperforms neuroscientists in picking between real and fake neuroscience abstract results.

Kim, 2024

Financial Statement Analysis with Large Language Models

When provided with anonymized public company financial statements, GPT-4 outperforms contemporary financial analysts at predicting whether earnings will increase or decrease in the following year.

3 of 32

  • Fine-tunes an LLM on forecasting dataset
  • Creates workflow for LLM to generate forecast
  • Runs LLM on forecasting questions that post-date training window but have resolved
  • Finds slightly worse than human crowd performance overall
  • Matches human crowd when selectively forecasting to its strengths (which are low-certainty predictions, early times-post-question, and questions with ample sources).

4 of 32

5 of 32

6 of 32

7 of 32

  • One-shot forecast with detailed prompt
  • Averages forecasts from twelve models
  • Compares average model forecast to median human forecast in a tournament with 31 binary questions.
  • LLM crowd (Brier score 0.20) and human crowd (0.19) are statistically similar.

8 of 32

9 of 32

10 of 32

  • A neuroscience prediction benchmark (BrainBench), which pairs a real neuroscience abstract with a faked version and asks test-takers to pick the real one

  • Both four generalist LLMs (Llama2, Mistral, Falcon, Galactica; 81.4% average accuracy) and an LLM fine-tuned on neuroscience texts (Llama2; 84%) beat individual performance (63.4%, 171 neuroscientists)

11 of 32

12 of 32

13 of 32

14 of 32

15 of 32

  • Uses anonymized dataset of public company financial statements (n=150,678) and analyst predictions
  • Provides chain-of-thought prompt asking GPT-4 to predict whether earnings will go up or down
  • Compares LLM prediction against median contemporary human analyst prediction.
  • LLM (60.31% accurate) beats median human forecast (52.94%) and is similar to specialized artificial neural network (60.45%)

16 of 32

17 of 32

18 of 32

19 of 32

  • Perception can be modeled as predictive models of the world being continually updated by sensory evidence
  • This “Bayesian brain” seeks to minimize prediction error.
  • Prediction also drives action. Organisms will take actions when they predict the actions will achieve world-states that are positive for them.

A Detour into Neuroscience:

Predictive Processing and Active Inference

20 of 32

Predict

Act

Sense

A Detour into Neuroscience:

Predictive Processing and Active Inference

21 of 32

  • Governments mostly lack dynamic systems linking perception and action*
  • Electoral mechanisms are slow and noisy
  • Political consequences make accurate information harder to collect and internalize
  • New information is not self-executing

* Monetary policy by the Federal Reserve is a useful counterexample

Government by Active Inference

22 of 32

  1. Take a policy goal where success would save the government money (e.g. increasing kidney transplants or reducing criminal recidivism)
  2. Create a fund of money to spend on programs predicted to achieve the goal
  3. Monitor results and update predictive model
  4. Repopulate the fund with money saved by the funded programs
  5. Repeat

Government by Active Inference:

What Might This Look Like?

23 of 32

Predict

Act

Sense

Government by Active Inference

24 of 32

  • Automated forecasting depoliticizes decision-making by taking it out of human hands
  • Requires a systematized, repeatable process with an observable and reliable track record
  • Requires an accurate sensory apparatus to observe success or failure and update the predictive model.

Government by Active Inference:

Where Does Automated Forecasting Fit?

25 of 32

  1. Fine-tune an LLM on the FDA’s massive internal record of clinical trial and drug approval applications
  2. Predict safety and efficacy results of new trials and drug applications
  3. Approve applications when predicted results exceed certain thresholds
  4. Gather real-world evidence to validate prediction and iteratively improve predictive model

Case Study: FDA Policy Reform

26 of 32

  1. Assemble institutions that commit to

(a) install indoor air quality interventions,

(b) conduct surveillance on results, and

(c) update installations when predicted to be cost-beneficial.

  • This creates a virtuous cycle between collecting more data, identifying more efficient air cleaning strategies, and improving confidence in the predicted results.
  • This cycle incentivizes wider and wider buyers club participation

Case Study: Indoor Air Quality Buyers Club

“Safe Air Research Alliance”

27 of 32

  • Predictive models can guide which experiments to try.
  • Experiments can be conducted that maximally reduce uncertainty and expected model error.

Active Inference and Science

What Could Prediction of Clinical Trials Unlock?

28 of 32

  • Generating human data is enormously expensive
  • Randomized trials may not predict real world performance
  • Randomized control trials may not motivate action.
  • Predictive methods may sometimes provide a better alternative.

Active Inference and Science

29 of 32

Clinical Trials: Sketch of a Predictive Alternative

  1. Predict the results of an intervention
  2. Conduct the intervention
  3. Observe the mismatch between prediction and results and update the prediction
  4. Repeat

30 of 32

Features of a Predictive Alternative

  • Requires both prediction of the trial outcome and the real world effect
  • Requires eventual real world evidence to validate the model
  • Doesn’t work well to reject placebo effects
  • Makes the most sense where the intervention is likely to be valuable and costs of randomized testing are high
    • i.e. part of the appeal is it harvests the benefit of the intervention to justify the costs of the experiment

31 of 32

Galaxy-Brained Dystopian Speculation

Prediction and Social Science

  • In the long-term, massively improved prediction would be socially transformative and could reduce human freedom
  • Precise understanding of human behavior would have unpredictable effects on people’s self-understanding
  • Reliable prediction of what inputs produce specific behavioral outputs could enable social coercion

32 of 32

  • Recent AI forecasting results imply we’ll soon have abundant, rapid, high-quality forecasting.

  • Superforecasting ‘too cheap to meter’ can unlock major use cases in science and government
    • Government by Active Inference
    • FDA
    • Human Subject Research
    • Potential Scary Stuff

A Predictive Revolution?

Josh Morrison

Manifest

6.9.2024