1 of 13

How accurate are Open Phil's predictions?

Javier Prieto

GPI Workshop on Forecasting Existential Risks and the Long-term Future

(June 2022)

2 of 13

Outline

  1. Forecasting at Open Phil
    1. Our forecasting workflows
    2. Motivation
  2. Some of our stats (as of March 2022)
    • Volume
    • Calibration
    • Accuracy
  3. Caveats and sources of bias
  4. Closing thoughts

3 of 13

User interface (a table in a Google doc)

Predictions

Scoring (you can leave this blank until you're able to score)

With X% confidence,

I predict that…(yes/no or confidence interval prediction)

…by Y time (ideally a date, not e.g. "in 1 year"):

Score (please stick to True / False / Not Assessed)

Comments or caveats about your score

80%

GPI will hire at least 2 postdocs

2022/7/1

60%

GPI will publish at least 3 papers about population ethics in peer-reviewed journals

2022/7/1

40%

GPI will run a workshop on forecasting x-risk

2022/7/1

True

I’m here!

4 of 13

Why make predictions?

  1. Improve the accuracy of our predictions over time, which in theory should improve the quality of our grantmaking decisions (on average, in the long run)

  • Clearer communication between grant investigators and decision-makers

  • Help assess grantee performance relative to initial expectations

  • Have “silver standard” feedback loops and accountability mechanisms, even when we can’t estimate our impact precisely (or at all)

5 of 13

6 of 13

Calibration

7 of 13

Accuracy

Brier Score = E(P[Y] - Y)2

= E(P[Y|p] - p)2 - E(P[Y|p] - P[Y])2 + P[Y](1 - P[Y])

= Miscalibration - Resolution + Entropy

= 0.004 - 0.037 + 0.250

= 0.217

8 of 13

Caveats and sources of bias

  1. Predictions are typically written and then later scored by the same person. There’s no easy way around this because some predictions are very in-the-weeds or can only be scored by directly asking the grantee, and the grant investigator who made the prediction is typically their point of contact at Open Phil.

  • There may be selection effects on which predictions have been scored. Predictions with overdue scores are typically those associated with active grants (we usually wait until grant closure / renewal to ask grant investigators for scores), so they may not be a representative sample of all predictions.

  • The analyses presented here are completely exploratory. All hypotheses were put forward after looking at the data, so this whole exercise should be better thought of as "narrative speculations" rather than "scientific hypothesis testing."

9 of 13

Closing thoughts

  1. We may want to implement changes to improve the accuracy and relevance of our forecasts, but we’re bottlenecked on grant investigator time, which in practice means that
    1. We can’t force them to switch apps, so there needs to be a process in place to extract predictions from grant writeups and log them in a database
    2. They can’t spend too much time coming up with clairvoyant statements or precise probability estimates, so questions will sometimes be vague / ambiguous and forecasts will sometimes be very rough / close to base rates
    3. Some have to be left unscored because figuring out how they resolved would take too much work
    4. (a)-(c) could be ameliorated by bringing in external parties to help write questions, make forecasts, and resolve them – but this is problematic for several reasons:
      1. Confidentiality concerns
      2. Predictions are very in-the-weeds so we don’t expect external parties to have great insight without a lot of background knowledge, which presumably can only be acquired through conversations with our grant investigators
      3. Harming relationships with our grantees if external parties are too demanding when trying to collect scores
    5. Maybe grant investigators should spend less time on grant-level predictions and more time on broad, strategy-level questions
  2. Not everyone at Open Phil finds writing predictions valuable given (i) their disconnect from decisions and (ii) the rigor-relevance trade-off
    • re: (i) I’m personally very scared of Goodhart’s Law, especially given that grant investigators choose and score predictions themselves, so there’s plenty of room for gaming (even if subconsciously)
    • re: (ii) I have no interesting thoughts – maybe bringing in experienced question writers would help, but see above for why incorporating outsiders would be tricky
  3. Good foresight ≠ good decisions because (among other things) the probability of an event is unrelated to its importance
    • This is a problem with forecasting polls in general: one can’t unilaterally decide to stake more credibility on more important questions
    • Rules awarding more points on more popular questions (à la Metaculus) get around this if you buy that importance ~ popularity
    • (Prediction) markets also get around this in a similar way (importance ~ liquidity)
    • Both (a) and (b) require the existence of external parties betting on the same questions, be it a crowd of forecasters (Metaculus) or a market maker + crowd of traders (markets) – this is very different from our current setup where every question gets only one forecast

10 of 13

Thank you!

Questions?

☝️EA Forum post☝️

11 of 13

Appendix: Prediction-level Brier scores

¼ of our predictions are worse than chance

Half of our predictions have an edge of ≥10%

¼ of our predictions have an edge of ≥25%

Q1 = .252 = .0625

Q2 = .42 = .16

Q3 = .52 = .25

We don’t seem to be getting more accurate over time

Predictions cluster around “round numbers” + 25% & 75%

12 of 13

Appendix: Range distribution (all predictions)

13 of 13

Appendix: Accuracy vs Range (scored predictions only)