1 of 45

Summarizing Probabilistic Forecast Performance with Verification Diagrams

Montgomery Flora

2 of 45

Topics of Discussion

  • How do we verify probabilities (i.e., values between 0-1) against binary outcomes (i.e., only 0 or 1)?
  • What statistics are commonly used to verify probabilistic prediction of binary outcomes?
  • How can we summarize multiple statistics on a single diagram?
    • ROC Diagram
    • Performance/ Precision-Recall Diagram
    • Reliability/ Attributes Diagram

3 of 45

How do we verify probabilities against binary outcomes?

For binary outcomes (e.g., yes/no, event vs. non-event), we can issue binary forecasts (0 or 1) or frequencies between 0-1 that we can interpret as probabilities.

For the former, we can build a contingency table:

Outcome

Predicted

Yes

No

Yes

No

False Alarm/

False Positive/

Type I Error

Hit/

True Positive

Missed/

False Negative/

Type II Error

Correct Negatives/

True Negative

Unfortunately, the nomenclature is inconsistent across different domains

4 of 45

How do we verify probabilities against binary outcomes?

For binary outcomes (e.g., yes/no, event vs. non-event), we can issue binary forecasts (0 or 1) or frequencies between 0-1 that we can interpret as probabilities.

For the former, we can build a contingency table:

Outcome

Predicted

Yes

No

Yes

No

False Alarm/

False Positive/

Type I Error

Hit/

True Positive

Missed/

False Negative/

Type II Error

Correct Negatives/

True Negative

A perfect forecast only produces hits and correct negatives with no misses or false alarms.

In reality, forecasts will have misses and false alarms. The main goal in model development is to limit them or, in most cases, strike a fair balance

A decision that ought to be made in junction with the end user.

5 of 45

Contingency Table Statistics

From the contingency table values,

we can compute several statistics to describe forecast performance:

6 of 45

Nomenclature Issues

Unfortunately, the same statistics are referred to by different names in different disciplines

7 of 45

Converting Probabilities �to Binary Predictions

 

With binary probabilities, we can compute contingency table metrics (see the denoted regions)

 

 

 

 

8 of 45

Verification Diagrams

Attribute Diagram

How reliable are the probabilities?

Performance Diagram (PD)

How correctly are the probabilities predicting events?

ROC Diagram

How well do the probabilities discriminate between

event and non-event?

9 of 45

ROC (Discrimination) Diagram

How well do the probabilities discriminate between

event and non-event (i.e., separate the red and green region)?

 

 

10 of 45

ROC Diagram

How well do the probabilities discriminate between

event and non-event (i.e., separate the red and green region)?

 

 

What proportion of the observed yes region is a hit?

Fix the misses region!

11 of 45

ROC Diagram

How well do the probabilities discriminate between

event and non-event (i.e., separate the red and green region)?

 

 

What proportion of the observed yes region is a hit?

What proportion of the observed no region is a false alarm?

12 of 45

ROC Diagram

 

If the predicted probabilities discriminate well, the POD ought to increase much faster than POFD as the threshold increases

If POD = POFD for all thresholds, then the predicted probabilities have no discrimination ability

13 of 45

ROC Diagram

We can summarize the ROC diagram by the area under the resulting curve (AUC).

AUC = 0.5 : No discrimination ability

AUC closer to 1.0 : Increasing discrimination ability

AUC: The probability that given a random pair of event and non-event examples, our prediction correctly discriminates between them

14 of 45

ROC Diagram

AUC has two important properties:

  1. Scale-Invariant
  2. Skew-Invariant

15 of 45

ROC Diagram

AUC has two important properties:

  1. Scale-Invariant
  2. Skew-Invariant

It is crucial to couple with a calibration/reliability-based diagram and associated metrics

(e.g., brier score, log-loss, etc)

AUC is insensitive to the absolute value of the forecast probabilities and thus two prediction system can produce the same AUC despite producing drastically different probabilities

16 of 45

ROC Diagram

AUC has two important properties:

  1. Scale-Invariant
  2. Class Imbalance-Invariant

AUC is insensitive to the distribution of events to non-events. E.g., these two figures produce a similar AUC

ISSUE: AUC is observation-centric, so it does not consider the relationship between false alarms and hits.

17 of 45

How do false alarms compare to hits?

How correctly are the probabilities predicting events? (i.e., hits vs. FA within the green region)?

 

What proportion of the forecasted yes region is a hit?

 

What proportion of the observed yes region is a hit?

18 of 45

Performance/Precision-Recall Diagram

 

 

Similar to the ROC Diagram, we plot POD and SR at a series of probability threshold

Unlike the ROC Diagram, there are other statistics we plot

How well did the forecasted yes correspond to observed yes?

Ratio of hits to false alarm

19 of 45

Variation on the Performance Diagram

Precision-Recall Diagram

With F1-score

 

20 of 45

Performance/Precision-Recall Diagram

In meteorology, we keep POD as the y-axis to be consistent with the ROC Diagram

One caveat is that the minimal SR is a function of the base rate/skew.

To account for the base rate:

  1. Adjust the SR to some target skew (Lampert and Gançarski 2014)

(2) Normalize the unachievable area

(Boyd et al. 2012)

Unachievable Area

Similar to the ROC Diagram, we can summarize the curve by a single statistic

21 of 45

Why is accounting for base rate crucial?

Let’s imagine we have two heavy rainfall datasets. The first one has a base rate of 50% and

the other a base rate of 10%

If we did not account for this base rate discrepancy, we are lead to a false conclusion that our model performs that better on the first dataset (50% base rate; blue curve) compared to the other dataset (10% base rate; orange)

However, when we correct for the base rate then we can see their performance is nearly identical (green curve vs. blue curve)

22 of 45

Why is accounting for base rate crucial?

This is an extreme example, but we can be lead to false conclusions even when the base rate differences are 1-3%

E.g., if you are testing your tornado ML model in the real-world and tornadoes are less frequent one year, then you to avoid concluding that your model performance worsened if you haven’t accounted for base rate discrepancies.

The difficulty is determining the appropriate target base rate for your problem.

23 of 45

How reliable are the predicted probabilities?

Do the probabilities correspond to long-term event frequencies?

Bin forecast probabilities and binary observations (e.g., every 10%) and compute the mean forecast probability and conditional event frequency per bin

If a forecast system is reliable, then the mean forecast probability ==

conditional event frequency for all bins

(the dashed diagonal)

24 of 45

How reliable are the predicted probabilities?

However, the conditional event frequency is sensitive to the binning and thus we can compute its uncertainty (vertical bars)

(Bröcker and Smith 2007)

If the conditional event frequency falls within those bars, the forecast is “probably reliable”, but outside and we can be confident it is unreliable.

25 of 45

What are the other components of the reliability diagram?

A large component of the reliability diagram is the brier (skill) score.

 

26 of 45

What are the other components of the reliability diagram?

A large component of the reliability diagram is the brier (skill) score.

 

27 of 45

What are the other components of the reliability diagram?

A large component of the reliability diagram is the brier (skill) score.

 

Reliability : How similar are the forecast probability and conditional event frequency per bin? Can they actually be interpreted as probabilities?

28 of 45

What are the other components of the reliability diagram?

A large component of the reliability diagram is the brier (skill) score.

 

Reliability : How similar are the forecast probability and conditional event frequency per bin? Can they actually be interpreted as probabilities?

Resolution : Are the forecast probabilities distinct from the

base rate per bin? How distinct can our probabilities be?

29 of 45

What are the other components of the reliability diagram?

A large component of the reliability diagram is the brier score.

 

Uncertainty : Based on base rate; not a component of forecast skill. For higher base rates, it is easier to be successful!

Reliability : How similar are the forecast probability and conditional event frequency per bin? Can they actually be interpreted as probabilities?

Resolution : Are the forecast probabilities distinct from the

base rate per bin? How distinct can our probabilities be?

30 of 45

What are the other components of the reliability diagram?

Often, we want to compare the BS of one system against another (create a skill score). For example, how does our ML model compare against always forecasting the base rate?

 

For skill scores, values > 0 (RES > REL) indicates the prediction is more skillful than predicting the base rate and vice versa if values are less than zero.

31 of 45

What are the other components of the reliability diagram?

For a BSS = 0, RES = REL. If we equate the two terms, then we can determine what conditional event frequencies delineate positive and negative BSS

Curves falling below the no-skill line are less skillful than always predicting the base rate

 

 

32 of 45

Object-based Verification

Montgomery Flora

33 of 45

Grid-based Verification

Typically, with gridded data, we compare corresponding grid points (e.g., mean squared error).

This approach is justifiable for continuous data (e.g., temperature, pressure, geopotential height)

What about discrete data (e.g., storms) or in cases where the forecast is slightly displaced from the observations?

34 of 45

What if our forecast is more discrete? How do we characterize the skill?

When the forecast problem is more discrete, describing average performance over the whole domain is less useful and possibly misleading.

In this example, we can see the forecast and observations are quite similar, but have been slightly displaced;

a displacement that might be operationally tolerable

35 of 45

Double-Penalty Problem

Despite the forecast and observations looking nearly identical, due to slight displacement, the forecast is double penalized

(false alarms and misses)

This is an unduly negative assessment of forecast performance since, operationally, small spatial displacements are inevitable and acceptable

Forecast Minus Observation

36 of 45

First Approach: Coarsen/smooth the data to perform grid-based verification

One approach is to upscale the data and apply smoothing. This effectively evaluates the spatial scale at which the

forecast is most skillful

The drawback of this approach is that it is still grid-based and ignores potentially

useful forecast information

at the original grid-scale

37 of 45

Second Approach: Use image segmentation methods to identify discrete regions

Match forecast and observed objects within a user-defined distance

Forecast

Observations

38 of 45

Second Approach: Use image segmentation methods to identify discrete regions

Match forecast and observed objects within a user-defined distance

Forecast

Observations

Match

Not Matched

39 of 45

Object- vs. Grid-Based Verification Summary

  • One is not better than the other!
    • No single verification method adequately describes the different attributes of forecast performance and it is crucial to develop complementary measures.
    • Distinction between an event-based vs. grid-based framework

  • Grid-based verification helps determine the spatial scales at which the forecast is most useful.

  • Object-based methods do not alter the forecast information, allow for small spatial displacements, and introduce additional aspect of forecast quality
    • Morphology variables (area, major axis length, etc)
    • Characteristic spatial biases (e.g., forecasts are typically slower than obs and thus lag to the southeast)

40 of 45

Object-based Verification Challenges

  • Effectively identifying regions, especially if multiple scales are involved, is an incredibly difficult task.
    • Identifying fronts, storms, tropical cyclones, etc each come with their own set of difficulties
    • Object ID may not work across different data types (e.g., model vs. observations of reflectivity)
    • Open area of research
  • Object matching has it own difficulties
    • What is an appropriate matching distance? Should we explore more than one?
    • Can we match based on other properties than distance?

  • Object-based verification is event-focused and does not consider correct negatives (correctly predicting a non-event).

41 of 45

Taylor Diagrams

  • Centered RMSE
  • STD
  • Correlation Coeff
  • Bias

42 of 45

Taylor Diagrams

Standard deviation

43 of 45

Taylor Diagrams

Correlation

Coefficient

44 of 45

Taylor Diagrams

Bias-corrected

RMSE

45 of 45

Taylor Diagrams

Bias