JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 20

Lectures 7-8: Combining Forecasts

Stat 165, Spring 2025

Slides credit: Jacob Steinhardt

2 of 20

Warm-up Question

�“How many 8.5 x 11 sheets of paper �does the average tree produce?”

5k
50k
10k
20k
4.5k
13k
200k
10k
100k
2.5k
10k

�

3 of 20

Histogram of Answers

��Now that you’ve seen the class’s distribution, what would you guess?

4 of 20

Wisdom of Crowds: Ox Judging Contest (Galton, 1907)

5 of 20

Wisdom of Crowds

Typical error: ± 30-40 lbs

Median answer: 1207 lbs

Correct answer: 1198 lbs

0.7% error!

6 of 20

Ways of Combining Forecasts

What are different techniques for combining forecasts?

Mean
Median
Trimmed mean
Weighted mean

General name for this: “ensembling” (also used in machine learning)

7 of 20

Mean

For probabilities: take average of the distributions
For numerical answers: take average of answers�
One reason this is good: convexity

Brier score: ½ * ((1-p)² + (1-q)²) ≤ (1-(p+q)/2)²
Log score also convex
Jensen’s inequality�

Any possible issues with this?

8 of 20

Median

Unlike mean, robust to outliers

Also independent of scale (same on linear or log scale)

�Disadvantage: uses data less efficiently (only cares about middle values)

9 of 20

Trimmed Mean

Remove top and bottom x% of data, then take mean of the remainder

Or remove the x% of data that is furthest from mean

Like median, robust to outliers
Like mean, makes use of most of data
For probabilities, can implicitly “extremize”:

Suppose 95% of class give p = 0.99, and 5% give p = 0.5
For x = 5%, the trimmed mean is 0.99, �while mean is 0.9655

10 of 20

Weighted Mean

Exercise:

What is the atomic number of cadmium?
How far off do you think you are?�(give as width of 80% interval)

Trimmed mean is special case of weighted mean, where we assign 0 weight to answers that are far from rest.

Implicit reasoning: “answers far from rest are probably wrong”

11 of 20

Implications for Your Own Forecasting

Think of a number a few different times and take the average

E.g. I often waffle back and forth on what number to go with; sometimes best to just take the average of numbers you’ve considered and call it a day�

When deciding what to believe, weigh various sources by how much you trust them and take a weighted average�
Work in teams and take average across team

Related idea: the “Delphi method” (later in lecture)

12 of 20

Weighing Experts

We are changing our call for the February FOMC meeting from a 50 [basis point] hike to a 25bp hike, although we think markets should continue to place some probability on a larger-sized hike. (source, Jan 18)� Shared by an economist at Citigroup, the 3rd largest banking institution in the US.

�

Pricing Wednesday morning pointed to a 94.3% probability of a 0.25 percentage point hike at the central bank’s two-day meeting that concludes Feb. 1, according to CME Group data. (source, Jan 18)

The CME group is the world's largest financial derivatives exchange. The CME FedWatch Tool uses futures

pricing data (the 30-Day Fed Funds futures pricing data) to analyze the probabilities of changes to the Fed rate.

�

Markets expect the Fed to raise rates again on February 1, 2023, probably by 0.25 percentage points…. However, there’s a reasonable chance the Fed opts for a larger 0.5 percentage point hike. (source, Jan 2)

Simon Moore is a writer at Forbes. He provides an outsourced Chief Investment Officer service to institutional

investors. He has previously served as Chief Investment Officer at Moola and FutureAdvisor, both are consumer

investment startups that were subsequently acquired by S&P 500 firms. He has published two books and is a CFA

Charterholder and educated at Oxford and Northwestern.

13 of 20

How do we choose the weights?

For experts: look at past track record

Improvement: track what type of �questions they are good at��

Mathematically: if estimates are unbiased�and independent, and estimate i has �standard deviation 𝛔_i, then weight by 1/𝛔_i²

If not independent, downweight estimates that �are more correlated with others
Hard to literally use in practice, but good conceptual motivation
Special case: finite sample error [roughly 1/sqrt(k) for k samples]

14 of 20

Cadmium weighted averages

Guess	CI width	Equal weights	Precision	1/sigma
40	10	1	0.01	0.1
80	30	1	0.001111111111	0.03333333333
70	40	1	0.000625	0.025
60	20	1	0.0025	0.05
60	30	1	0.001111111111	0.03333333333
80	20	1	0.0025	0.05
50	30	1	0.001111111111	0.03333333333
70	20	1	0.0025	0.05
55	20	1	0.0025	0.05
60	30	1	0.001111111111	0.03333333333
Weighted average (truth = 48)		62.5	55.20775623	59.63636364

15 of 20

Working in Teams: The Delphi Method

Delphi method:

Forecasters individually come up with predictions and reasoning
Then provide predictions + reasoning to group
Individuals update based on group forecast [potentially multiple rounds]
At end, take average of all of the final individual forecasts

Variants:

Predictions + reasoning provided anonymously
Only reasoning given (not numerical predictions)

Question. Why come up with numbers individually (rather than working collaboratively the whole time?)

16 of 20

Asch experiments (Wikipedia)

17 of 20

Ensembling with Yourself

What was the total annual budget of the �US government in FY2022?

Come up with at least 3 distinct approaches �to Fermi estimate this.��Then, decide how to combine the estimates �together.

18 of 20

Combining Confidence Intervals

What if instead of point estimates, we have 80% confidence intervals?

[a₁, b₁], [a₂, b₂], … (a_i = lower end, b_i = upper end)�
Simplest approach: take trimmed mean of upper/lower ends
Alternatives:

Mixture of distributions

Variance of mixture = mixture of variance + variance of means
Implies widening width if means disagree

Treat as independent “measurements”

Implies narrowing width of intervals; need to be careful to avoid overconfidence

19 of 20

Combining Sums

What if we are predicting X + Y, and have confidence intervals for X and Y?

If expect errors to be independent, then std(X+Y) = sqrt(std(X)^2 + std(Y)^2)
If errors are perfectly correlated, then std(X+Y) = std(X) + std(Y)

For 70%/80% CI, stdev is usually a decent approximation

For extreme tails (99% CI), can be more complicated.

If X and Y are heavy-tailed, tail event comes from one of X or Y individually

20 of 20

Summary

Averaging multiple approaches or experts often improves forecasts�
Assess track record and accuracy of sources to determine weights�
Consider working in teams and generating independent numbers�
Combining confidence intervals: several ideas, no silver bullet (yet)