1 of 13

How accurate are Open Phil's predictions?

Javier Prieto

GPI Workshop on Forecasting Existential Risks and the Long-term Future

(June 2022)

2 of 13

Outline

Forecasting at Open Phil

Our forecasting workflows
Motivation

Some of our stats (as of March 2022)

Volume
Calibration
Accuracy

Caveats and sources of bias
Closing thoughts

3 of 13

User interface (a table in a Google doc)

Predictions			Scoring (you can leave this blank until you're able to score)
With X% confidence,	I predict that…(yes/no or confidence interval prediction)	…by Y time (ideally a date, not e.g. "in 1 year"):	Score (please stick to True / False / Not Assessed)	Comments or caveats about your score
80%	GPI will hire at least 2 postdocs	2022/7/1
60%	GPI will publish at least 3 papers about population ethics in peer-reviewed journals	2022/7/1
40%	GPI will run a workshop on forecasting x-risk	2022/7/1	True	I’m here!

4 of 13

Why make predictions?

Improve the accuracy of our predictions over time, which in theory should improve the quality of our grantmaking decisions (on average, in the long run)

Clearer communication between grant investigators and decision-makers

Help assess grantee performance relative to initial expectations

Have “silver standard” feedback loops and accountability mechanisms, even when we can’t estimate our impact precisely (or at all)

5 of 13

6 of 13

Calibration

7 of 13

Accuracy

Brier Score = E(P[Y] - Y)²

= E(P[Y|p] - p)² - E(P[Y|p] - P[Y])² + P[Y](1 - P[Y])

= Miscalibration - Resolution + Entropy

= 0.004 - 0.037 + 0.250

= 0.217

8 of 13

Caveats and sources of bias

Predictions are typically written and then later scored by the same person. There’s no easy way around this because some predictions are very in-the-weeds or can only be scored by directly asking the grantee, and the grant investigator who made the prediction is typically their point of contact at Open Phil.

There may be selection effects on which predictions have been scored. Predictions with overdue scores are typically those associated with active grants (we usually wait until grant closure / renewal to ask grant investigators for scores), so they may not be a representative sample of all predictions.

The analyses presented here are completely exploratory. All hypotheses were put forward after looking at the data, so this whole exercise should be better thought of as "narrative speculations" rather than "scientific hypothesis testing."

9 of 13

Closing thoughts

We may want to implement changes to improve the accuracy and relevance of our forecasts, but we’re bottlenecked on grant investigator time, which in practice means that

We can’t force them to switch apps, so there needs to be a process in place to extract predictions from grant writeups and log them in a database
They can’t spend too much time coming up with clairvoyant statements or precise probability estimates, so questions will sometimes be vague / ambiguous and forecasts will sometimes be very rough / close to base rates
Some have to be left unscored because figuring out how they resolved would take too much work
(a)-(c) could be ameliorated by bringing in external parties to help write questions, make forecasts, and resolve them – but this is problematic for several reasons:

Confidentiality concerns
Predictions are very in-the-weeds so we don’t expect external parties to have great insight without a lot of background knowledge, which presumably can only be acquired through conversations with our grant investigators
Harming relationships with our grantees if external parties are too demanding when trying to collect scores

Maybe grant investigators should spend less time on grant-level predictions and more time on broad, strategy-level questions

Not everyone at Open Phil finds writing predictions valuable given (i) their disconnect from decisions and (ii) the rigor-relevance trade-off

re: (i) I’m personally very scared of Goodhart’s Law, especially given that grant investigators choose and score predictions themselves, so there’s plenty of room for gaming (even if subconsciously)
re: (ii) I have no interesting thoughts – maybe bringing in experienced question writers would help, but see above for why incorporating outsiders would be tricky

Good foresight ≠ good decisions because (among other things) the probability of an event is unrelated to its importance

This is a problem with forecasting polls in general: one can’t unilaterally decide to stake more credibility on more important questions
Rules awarding more points on more popular questions (à la Metaculus) get around this if you buy that importance ~ popularity
(Prediction) markets also get around this in a similar way (importance ~ liquidity)
Both (a) and (b) require the existence of external parties betting on the same questions, be it a crowd of forecasters (Metaculus) or a market maker + crowd of traders (markets) – this is very different from our current setup where every question gets only one forecast

10 of 13

Thank you!

Questions?

☝️EA Forum post☝️

11 of 13

Appendix: Prediction-level Brier scores

¼ of our predictions are worse than chance

Half of our predictions have an edge of ≥10%

¼ of our predictions have an edge of ≥25%

Q1 = .25²= .0625

Q2 = .4²= .16

Q3 = .5²= .25

We don’t seem to be getting more accurate over time

Predictions cluster around “round numbers” + 25% & 75%

12 of 13

Appendix: Range distribution (all predictions)

13 of 13

Appendix: Accuracy vs Range (scored predictions only)