1 of 20

Lectures 7-8: Combining Forecasts

Stat 165, Spring 2025

Slides credit: Jacob Steinhardt

2 of 20

Warm-up Question

�“How many 8.5 x 11 sheets of paper �does the average tree produce?”

  • 5k
  • 50k
  • 10k
  • 20k
  • 4.5k
  • 13k
  • 200k
  • 10k
  • 100k
  • 2.5k
  • 10k

3 of 20

Histogram of Answers

���������Now that you’ve seen the class’s distribution, what would you guess?

4 of 20

Wisdom of Crowds: Ox Judging Contest (Galton, 1907)

5 of 20

Wisdom of Crowds

Typical error: ± 30-40 lbs

Median answer: 1207 lbs

Correct answer: 1198 lbs

0.7% error!

6 of 20

Ways of Combining Forecasts

What are different techniques for combining forecasts?

  • Mean
  • Median
  • Trimmed mean
  • Weighted mean

General name for this: “ensembling” (also used in machine learning)

7 of 20

Mean

  • For probabilities: take average of the distributions
  • For numerical answers: take average of answers�
  • One reason this is good: convexity
    • Brier score: ½ * ((1-p)² + (1-q)²) ≤ (1-(p+q)/2)²
    • Log score also convex
    • Jensen’s inequality�
  • Any possible issues with this?

8 of 20

Median

Unlike mean, robust to outliers

Also independent of scale (same on linear or log scale)

�Disadvantage: uses data less efficiently (only cares about middle values)

9 of 20

Trimmed Mean

  • Remove top and bottom x% of data, then take mean of the remainder
    • Or remove the x% of data that is furthest from mean
  • Like median, robust to outliers
  • Like mean, makes use of most of data
  • For probabilities, can implicitly “extremize”:
    • Suppose 95% of class give p = 0.99, and 5% give p = 0.5
    • For x = 5%, the trimmed mean is 0.99, �while mean is 0.9655

10 of 20

Weighted Mean

Exercise:

  • What is the atomic number of cadmium?
  • How far off do you think you are?�(give as width of 80% interval)

Trimmed mean is special case of weighted mean, where we assign 0 weight to answers that are far from rest.

  • Implicit reasoning: “answers far from rest are probably wrong”

11 of 20

Implications for Your Own Forecasting

  • Think of a number a few different times and take the average
    • E.g. I often waffle back and forth on what number to go with; sometimes best to just take the average of numbers you’ve considered and call it a day�
  • When deciding what to believe, weigh various sources by how much you trust them and take a weighted average�
  • Work in teams and take average across team
    • Related idea: the “Delphi method” (later in lecture)

12 of 20

Weighing Experts

We are changing our call for the February FOMC meeting from a 50 [basis point] hike to a 25bp hike, although we think markets should continue to place some probability on a larger-sized hike. (source, Jan 18)Shared by an economist at Citigroup, the 3rd largest banking institution in the US.

Pricing Wednesday morning pointed to a 94.3% probability of a 0.25 percentage point hike at the central bank’s two-day meeting that concludes Feb. 1, according to CME Group data. (source, Jan 18)

The CME group is the world's largest financial derivatives exchange. The CME FedWatch Tool uses futures

pricing data (the 30-Day Fed Funds futures pricing data) to analyze the probabilities of changes to the Fed rate.

Markets expect the Fed to raise rates again on February 1, 2023, probably by 0.25 percentage points…. However, there’s a reasonable chance the Fed opts for a larger 0.5 percentage point hike. (source, Jan 2)

Simon Moore is a writer at Forbes. He provides an outsourced Chief Investment Officer service to institutional

investors. He has previously served as Chief Investment Officer at Moola and FutureAdvisor, both are consumer

investment startups that were subsequently acquired by S&P 500 firms. He has published two books and is a CFA

Charterholder and educated at Oxford and Northwestern.

13 of 20

How do we choose the weights?

  • For experts: look at past track record
    • Improvement: track what type of �questions they are good at��
  • Mathematically: if estimates are unbiased�and independent, and estimate i has �standard deviation 𝛔i, then weight by 1/𝛔i2
    • If not independent, downweight estimates that �are more correlated with others
    • Hard to literally use in practice, but good conceptual motivation
    • Special case: finite sample error [roughly 1/sqrt(k) for k samples]

14 of 20

Cadmium weighted averages

Guess

CI width

Equal weights

Precision

1/sigma

40

10

1

0.01

0.1

80

30

1

0.001111111111

0.03333333333

70

40

1

0.000625

0.025

60

20

1

0.0025

0.05

60

30

1

0.001111111111

0.03333333333

80

20

1

0.0025

0.05

50

30

1

0.001111111111

0.03333333333

70

20

1

0.0025

0.05

55

20

1

0.0025

0.05

60

30

1

0.001111111111

0.03333333333

Weighted average (truth = 48)

62.5

55.20775623

59.63636364

15 of 20

Working in Teams: The Delphi Method

Delphi method:

  • Forecasters individually come up with predictions and reasoning
  • Then provide predictions + reasoning to group
  • Individuals update based on group forecast [potentially multiple rounds]
  • At end, take average of all of the final individual forecasts

Variants:

  • Predictions + reasoning provided anonymously
  • Only reasoning given (not numerical predictions)

Question. Why come up with numbers individually (rather than working collaboratively the whole time?)

16 of 20

Asch experiments (Wikipedia)

17 of 20

Ensembling with Yourself

What was the total annual budget of the US government in FY2022?

Come up with at least 3 distinct approaches �to Fermi estimate this.��Then, decide how to combine the estimates �together.

18 of 20

Combining Confidence Intervals

What if instead of point estimates, we have 80% confidence intervals?

  • [a1, b1], [a2, b2], … (ai = lower end, bi = upper end)�
  • Simplest approach: take trimmed mean of upper/lower ends
  • Alternatives:
    • Mixture of distributions
      • Variance of mixture = mixture of variance + variance of means
      • Implies widening width if means disagree
    • Treat as independent “measurements”
      • Implies narrowing width of intervals; need to be careful to avoid overconfidence

19 of 20

Combining Sums

What if we are predicting X + Y, and have confidence intervals for X and Y?

  • If expect errors to be independent, then std(X+Y) = sqrt(std(X)^2 + std(Y)^2)
  • If errors are perfectly correlated, then std(X+Y) = std(X) + std(Y)

For 70%/80% CI, stdev is usually a decent approximation

For extreme tails (99% CI), can be more complicated.

  • If X and Y are heavy-tailed, tail event comes from one of X or Y individually

20 of 20

Summary

  • Averaging multiple approaches or experts often improves forecasts�
  • Assess track record and accuracy of sources to determine weights�
  • Consider working in teams and generating independent numbers�
  • Combining confidence intervals: several ideas, no silver bullet (yet)