1 of 259

Applied Machine Learning for Earth Scientists

Elizabeth Barnes, Marybeth Arcodia, Charlotte Connolly, Frances Davenport, Zaibeth Carlo Frontera, Emily Gordon, Daniel Hueholt, Antonios Mamalakis, Elina Valkonen

1

2 of 259

Please cite these slides with the doi provided by Zenodo. Linked here.

3 of 259

Topics Covered

3

Basics of Machine Learning Applications to Earth Science

  • Machine Learning for Science
  • Foundational Concepts of ML
  • Decision Trees and Random Forests
  • Artificial Neural Networks (ANNs)
  • ANN Coding Examples
  • Advanced ANN Techniques

Visualization and Explainability of Machine Learning in Earth Science

  • Ethical Use of AI in Earth Science
  • Implementation and Assessment of an ANN
  • Methods of Explainable AI (XAI) for ANNs
  • Simple Uncertainty Quantification

4 of 259

Resources

  • Github: https://github.com/eabarnes1010/ml_tutorial_csu
    • Open-access code for ML examples included here and some other common ML applications
  • Additional ML Resources
    • Google doc with online tutorials, videos, books, papers, etc.

4

5 of 259

Organizers

Elizabeth Barnes: Professor

Marybeth Arcodia: Postdoc

    • Research: Understanding sources of S2S predictability using artificial neural networks
    • Email: marcodia@rams.colostate.edu

Antonios Mamalakis: Postdoc

    • Research: Dr Mamalakis’ research focuses on the application of machine learning (ML) and explainable AI (XAI) methods to climate problems and on climate predictability and teleconnections.
    • Website: https://amamalak.wixsite.com/antonios
    • Email: amamalak@colostate.edu

Elina Valkonen: Postdoc

    • Research: Utilizing β€œsegmentation framework” to detect weather patterns in global climate models to help improve model evaluations. Specific focus on Arctic regions.
    • Email: elina.valkonen@colostate.edu

Frances Davenport: Postdoc

    • Research: using β€œtransfer learning” to train neural networks with both climate model data and observations to make better real-world S2S predictions
    • Website: https://fdavenport.github.io
    • Email: f.davenport@colostate.edu

5

6 of 259

Organizers

6

Materials also sourced from Wei-Ting Hsiao, Jamin Rader, Aaron Hill, Imme Ebert-Uphoff, Kirsten Mayer, Ben Toms, Yoonjin Lee @ CSU

Emily Gordon: PhD Student

Charlie Connolly: MS Student

    • Research: Internal variability and climate change

Website: https://sites.google.com/view/connolly-climate/home

    • Email: cconn@rams.colostate.edu

Daniel Hueholt: MS Student

    • Research: I study potential methods to intervene in the Earth system in order to reduce risks and impacts from climate change.
    • Website: hueholt.earth
    • Email: dhueholt@rams.colostate.edu

Zaibeth Carlo Frontera: MS Student

    • Research: Hurricane number prediction with 2-3 forecast lead in the East Pacific using Random Forests.
    • Email: zaibeth.carlo-frontera@colostate.edu

7 of 259

What you will learn here...

  • Neural networks are not magic!
  • Basic concepts and terminology
  • Simple code to get started
  • Examples of topics being explored in the field

7

8 of 259

What you will not learn...

  • A perfect formula for choosing the ML method right for your application
  • How to choose/optimize specific parameters a priori οΏ½(no one really knows anyway)
  • Other unsupervised learning methods (e.g. clustering)
  • All of the pitfalls associated with every choice

8

9 of 259

Machine Learning for Science

9

10 of 259

The β€œblack box”

10

data

prediction

11 of 259

We have a lot of data...

We are creating data faster than we can process.

11

12 of 259

Our field has a big toolbox...

12

EOF Analysis

Spectral

Analysis

Linear trend detection

Correlations

Dynamical Model Simulations

13 of 259

Machine Learning offers an additional set of tools for the job.

13

14 of 259

Machine Learning*

Machine Learning*

14

One of our jobs as scientists is to sift through piles of data and try to extract useful relationships that apply elsewhere, i.e. that are applicable β€œout of sample”.

This is what many machine learning methods are designed to do.

15 of 259

Commercial Applications

Machine learning has made huge inroads for commercial applications

For example, by the early 2000s convolutional neural networks processed 10-20% of all checks in the U.S.

The concept of machine learning has been around since the early 1950’s. The explosion in the past decade can be attributed to advances in training deep networks and the explosion of available data

Self-driving vehicles

Facial recognition

Natural language processing

https://blog.cloudsight.ai/chihuahua-or-muffin-1bdf02ec1680

Chihuahua or οΏ½Blueberry Muffin?

https://towardsdatascience.com/teaching-cars-to-see-vehicle-detection-using-machine-learning-and-computer-vision-54628888079a

Visage Technologies Ltd, Creative οΏ½Commons License, Wikimedia.org

https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

16 of 259

Commercial Applications

Machine learning has made huge inroads for commercial applications

For example, by the early 2000s convolutional neural networks processed 10-20% of all checks in the U.S.

The concept of machine learning has been around since the early 1950’s. The explosion in the past decade can be attributed to advances in training deep networks and the explosion of available data

Takes a winter image and turns it into summer

17 of 259

Commercial Applications

Machine learning has made huge inroads for commercial applications

For example, by the early 2000s convolutional neural networks processed 10-20% of all checks in the U.S.

The concept of machine learning has been around since the early 1950’s. The explosion in the past decade can be attributed to advances in training deep networks and the explosion of available data

It can fail spectacularly too!

Heaven, D., 2019: Why deep-learning AIs are so easy to fool. Nature, 574, 163–166.

18 of 259

Commercial Applications

Machine learning has made huge inroads for commercial applications

For example, by the early 2000s convolutional neural networks processed 10-20% of all checks in the U.S.

The concept of machine learning has been around since the early 1950’s. The explosion in the past decade can be attributed to advances in training deep networks and the explosion of available data

It can fail spectacularly too!

Heaven, D., 2019: Why deep-learning AIs are so easy to fool. Nature, 574, 163–166.

19 of 259

Science!

Even with its β€œblack box” persona, ML has already aided significant scientific advances

e.g., the area of bioinformatics has exploded in recent years due to machine learning and data mining

Distinguishing high-energy particles οΏ½in the Large Hadron Collider

Gene prediction and sequencing

Predicting properties of solar flares

Identifying drug-drug interactions

By Lucas Taylor / CERN - http://cdsweb.cern.ch/record/628469, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1433671

20 of 259

Science!

Even with its β€œblack box” persona, ML has already aided significant scientific advances

e.g., the area of bioinformatics has exploded in recent years due to machine learning and data mining

The number of science articles using supervised learning have seen large trends in recent year

However, AMS papers with supervised machine learning have lagged behind those across all of geoscience

Maskey, M., H. Alemohammad, K. J. Murphy, and R. Ramachandran (2020), Advancing AI for Earth science: A data systems perspective, Eos, 101, https://doi.org/10.1029/2020EO151245

21 of 259

ML for Climate Science

The field’s interest, and research, has exploded in the past ~3 years!

Applications of ML for atmospheric science dates back as far as the 1960’s!

21

1998

2004

1964

22 of 259

ML for Climate Science

The field’s interest, and research, has exploded in the past ~3 years!

Applications of ML for atmospheric science dates back as far as the 1960’s!

Range of applications:

  • global weather prediction
  • convective & radiative parameterizations
  • downscaling
  • extreme event detection
  • data reconstruction
  • weather prediction
  • processing of remote sensing data

22

Weather PredictionοΏ½e.g. Gagne et al. (2019); Gagne et al. (2017); Chattopadhyay et al. (2019); Lagerquist et al. (2020)οΏ½

Weyn et al. (2020)

Convective parameterizationsοΏ½e.g. Rasp et al. (2018; PNAS); Schneider et al. (2017; GRL); O’Gorman and Dwyer (2018); Beucler et al. (2020; PRL)

Brenowitz and Bretherton (2018)

Extreme event detectionοΏ½e.g. Reichstein et al. (2019)

οΏ½

Equation discoveryοΏ½e.g. Zanna and Bolton (2020)οΏ½

23 of 259

ML for Climate Science

The field’s interest, and research, has exploded in the past ~3 years!

Applications of ML for atmospheric science dates back as far as the 1960’s!

Range of applications:

  • global weather prediction
  • convective & radiative parameterizations
  • downscaling
  • extreme event detection
  • data reconstruction
  • weather prediction
  • processing of remote sensing data

23

Commercial application: Infilling an image

NVIDIA ResearchοΏ½https://www.youtube.com/watch?v=gg0F5JjKmhA

24 of 259

ML for Climate Science

The field’s interest, and research, has exploded in the past ~3 years!

Applications of ML for atmospheric science dates back as far as the 1960’s!

Range of applications:

  • global weather prediction
  • convective & radiative parameterizations
  • downscaling
  • extreme event detection
  • data reconstruction
  • weather prediction
  • processing of remote sensing data

24

Commercial application: Infilling an image

Kadow et al. (2020; Nature Geoscience)

Also see for other reconstruction applications DelSole and Nedza (2020)

NVIDIA ResearchοΏ½https://www.youtube.com/watch?v=gg0F5JjKmhA

Evidence of the reportedοΏ½1877 El Nino

Climate Application: Temperature reconstruction (e.g. 1877)

25 of 259

ML for Climate Science

Communicating climate change is another area with great promise for ML

Groups are working on using deep learning to produce accurate and vivid renderings of the future outcomes of climate change

25

https://mila.quebec/en/ai-society/visualizing-climate-change/

26 of 259

Reasons to use ML for Science

  • Do it better
    • e.g. Convective parameterizations in models are not perfect, use ML to make them more accurate
  • Do it faster
    • e.g. Radiation code in models is very slow (but we know the right answer) - use ML methods to speed things up
  • Do something new
    • e.g. Go looking for non-linear relationships you didn’t know were there

26

Very relevant for research: may be slower and worse, but can still learn something

27 of 259

Example uses in Climate Science

  • Making climate models better
    • Parameterizations
    • Speed-up computations
  • Better predictions
    • Post-processing dynamical (physics-based) forecasts
    • Purely data-driven forecasts
  • Sources of predictability
    • process-based understanding
    • identifying dynamical model errors/biases
  • Feature identification
    • Counting clouds, labeling cloud types in images
    • Counting/identifying extreme events
  • Summarizing a lot of data
    • Dimension reduction
    • Cluster/group behaviour

27

28 of 259

Foundational Concepts of ML

28

29 of 259

What is Machine Learning?

Practically speaking:

Techniques for fitting data.

It is not restricted by a single or a set of simple mathematical expressions.

It can be as complicated by putting in procedures, judgement statements, even randomness...

29

Field of study that gives computers the ability to learn without being explicitly programmed.

Defined by Arthur Samuel (1959)

30 of 259

Types of machine learning

Unsupervised learningοΏ½looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision, e.g.

  • principal component analysis
  • clustering methods (k-means, self-organizing maps)
  • autoencoders

30

By I, Weston.pace, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2463085οΏ½https://www.vectorstock.com/royalty-free-vector/neural-network-vector-18470587

31 of 259

Types of machine learning

Unsupervised learningοΏ½looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision, e.g.

  • principal component analysis
  • clustering methods (k-means, self-organizing maps)
  • autoencoders

Supervised learningοΏ½maps an input to an output based on example input-output pairs, e.g.

  • regression
  • decision trees and random forests
  • support vector machines
  • artificial neural networks

31

There are other types, but these are the two main flavors we will discuss

By I, Weston.pace, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2463085οΏ½https://www.vectorstock.com/royalty-free-vector/neural-network-vector-18470587

32 of 259

How do you predict y from x?

32

x

y

sea surface temperature

convection

occurrence

33 of 259

Linear regression model

Model: (a, b)

Find (a, b) that minimizes the RMSE.

33

Features

Labels

Model

the architecture:

y = ax + b

With the assumption (or, architecture) that the model is a 1-degree polynomial, we describe this model as:

(a, b) = (0.5, 2)

x

y

2

2

Loss Function

Root-mean-square error

Optimization

We always need a metric to define how β€œgood” a model is.

34 of 259

Nonlinear regression model

34

x

y

2

2

Features

Labels

35 of 259

Nonlinear regression model

35

Model

y = ax2 + bx + c

With the architecture that the model is a 2-degree polynomial:

Minimize the loss function: RMSE

We could find the model: (a, b, c) = (-0.2, 2, 0)

x

y

2

2

Features

Labels

36 of 259

Classification-like model

36

x

y

Features

2

2

With this architecture, the model is:

Labels

Model

x

y

0.0 < x < 3.0

A

3.0 < x < 4.1

B

4.1 < x < 5.6

C

5.6 < x < 9.7

D

A

B

C

D

37 of 259

Selection of architectures

37

FACT:

Perfect Learning is impossible.

(Since not β€œall” the data is available.)

x

y

2

2

Which model is right?

Neither.

Which architecture should be used?

It depends.

model #1

model #2

1) have a good quality of the data set

2) wisely choose architectures according to data properties

3) wisely use your data

increase the possibility for us to find good models (based on your purpose)!

38 of 259

Machine Learning Approach

38

Selecting Predictors

Question: Will Marybeth catch the bus to campus?

All possible factors

Output: Catch the bus?

It’s complicated!

39 of 259

Machine Learning Approach

39

Selecting Predictors

Question: Will Marybeth catch the bus to campus?

Inputs/Predictors:

  • Time she woke up
  • Trash day
  • Weather conditions
  • Current record for the Yankees
  • The color of the sunrise
  • ...

Truth/Predictand:

  • Yes or No

All possible factors

Output: Catch the bus?

It’s complicated!

40 of 259

Machine Learning Approach

40

Selecting Predictors

Past Data: Will Marybeth catch the bus to campus?

Marybeth caught the bus to campus 108 of 150 times that it was sunny.

Marybeth never caught the bus when she woke up after 9:00am (sample size of 780 days).

On the one day the sunrise was neon green, Marybeth caught the bus!

Wild, right? Wouldn’t anyone wake up early to watch a neon green sunrise?

All possible factors

Output: Catch the bus?

It’s complicated!

41 of 259

Machine Learning Approach

41

Selecting Predictors

Past Data: Will Marybeth catch the bus to campus?

Marybeth caught the bus to campus 108 of 150 times that it was sunny.

Marybeth never caught the bus when she woke up after 9:00am (sample size of 780 days).

On the one day the sunrise was neon green, Marybeth caught the bus!

All possible factors

Output: Catch the bus?

It’s complicated!

Should we use all possible factors in machine learning?

42 of 259

Machine Learning Approach

42

Selecting Predictors

Question: Will Marybeth catch the bus to campus?

Inputs/Predictors:

  • Time she woke up
  • Trash day
  • Weather conditions
  • Current record for the Yankees
  • The color of the sunrise
  • ...

Truth/Predictand:

  • Yes or No

ML Method

Predictors

Output: Catch the bus?

43 of 259

Machine Learning Approach

43

Data Splitting is a very important aspect of ML design

44 of 259

Machine Learning Approach

44

Data Splitting is a very important aspect of ML design

Data split into 3 parts: training, validation, testing

Why?

  • Reduce overfitting
  • Optimize hyperparameters

*Often need to standardize your data!

45 of 259

Machine Learning Approach

45

Training Data

Data subset used to fit the ML model; data the model uses to learn and optimize (i.e. minimize loss)

Validation Data

Data subset used to tune model hyperparameters via unbiased evaluation of model fit

Testing Data

Data subset used to evaluate the final ML model on data not seen before

46 of 259

How to Split Data

46

Random Splitting of Data

47 of 259

How to Split Data

47

Random Splitting of Data

Split full dataset into 80% training and 20% testing

Split training subset into 75% training and 25% validation

48 of 259

How to Split Data

48

Climate data is often autocorrelated, so we split data chronologically

Example: Full dataset from 1900-2000

  • Training: 1900-1980
  • Validation: 1981-1990
  • Testing: 1991- 2000

A L L

THE

DATA

49 of 259

Machine Learning Approach

49

Groups of Data

A

LοΏ½L

TοΏ½HοΏ½E

DοΏ½AοΏ½TοΏ½A

Training Data

Validation Data

Testing Data

x

y

2

2

training set

testing set

solution represents training data and testing data well

validation set

50 of 259

Machine Learning Approach

50

Groups of Data

A

LοΏ½L

TοΏ½HοΏ½E

DοΏ½AοΏ½TοΏ½A

Training Data

Validation Data

Testing Data

x

y

2

2

training set

testing set

Overfitting

too closely fitting the training data such that the model will fail on unseen data of the same type

this is a perfect model for the training data, but a very poor model for our testing data

validation set

51 of 259

Machine Learning Approach

Training is an iterative process.

Learning happens by training on past data.

Each iteration (epoch) of training, the machine learning model should better fit the testing data.

At the beginning of training, the machine learning model has no skill.

51

Training the Model

52 of 259

Machine Learning Approach

Training is an iterative process.

Learning happens by training on past data.

Each iteration (epoch) of training, the machine learning model should better fit the testing data.

At the beginning of training, the machine learning model has no skill.

52

Training the Model

Interviewer: What’s your biggest strength?

Me: Machine Learning.

Interviewer: What’s 9 + 6?

Me: 0.

Interviewer: Incorrect. It’s 15.

Me: It’s 15.

Interviewer: What’s 20 + 4?

Me: It’s 15.

*** continues for 1000 epochs ***

53 of 259

Machine Learning Approach

Training is an iterative process.

Learning happens by training on past data.

Each iteration (epoch) of training, the machine learning model should better fit the testing data.

At the beginning of training, the machine learning model has no skill.

53

Training the Model

Interviewer: What’s your biggest strength?

Me: Machine Learning.

Interviewer: What’s 9 + 6?

Me: 0.

Interviewer: Incorrect. It’s 15.

Me: It’s 15.

Interviewer: What’s 20 + 4?

Me: It’s 15.

*** continues for 1000 epochs ***

Initially no skill

Model updates

Based on past data

54 of 259

Machine Learning Approach

Training data is put into the machine learning model’s learned function.

The model outputs its predictions. These predictions are compared with their truth values.

The function is updated to decrease the error between predicted and truth outputs.

This cycle continues.

54

Training the Model

xi

training data

(predictors)

g

learned function

Ε·i

predicted outputs

(predictands)

Ε·i

predicted outputs

yi

truth outputs

vs

1 epoch

55 of 259

Some terminology

WHAT WE CALL IT

Dependent Variable/Right Answer

Variable/Predictor

Slopes/Regression Coefficients

Y-intercepts/Constant factor

WHAT ML CALLS IT

Label

Feature

Weights

Bias

55

56 of 259

Summary

56

What is ML?

  • Machine learning is a technique to fit a function to data
  • Is it not constrained by simple mathematical expressions

The Practical Procedure of ML

  • Select predictors intelligently
  • Split data into training, validation, and testing sets
  • The model learns by training iteratively to optimize a loss function

Vocab

  • Architecture: y = ax+b Loss Function: RMSE Features/Labels: x / y
  • Overfitting: Fitting a model too closely to training data such that it does poorly with unseen data

57 of 259

Decision Trees and Random Forests

A brief overview

57

Image generated by Wombo AI

58 of 259

Meet Atlas!

58

59 of 259

Will Atlas want to play outside?

59

Playful!

or

Sleepy πŸ’€

60 of 259

Goal: design questions to classify events so predictions with new data are accurate

60

Observed data/events:

Will Atlas want to play outside?

Predictors:

Temperature, tiredness, boredom, hunger, fur length, snow, etc.

Output

Yes! Atlas is playful

No! Atlas is sleepy

Forest icon made by Freepik from flaticon.com

A decision tree is a series of questionsοΏ½Is temperature above 80?

Is it snowy?

Did he play yesterday?

61 of 259

Will Atlas want to play outside?

61

Is the temperature below 80 ˚F?

Yes!

No

Sleepy πŸ’€

62 of 259

Will Atlas want to play outside?

62

Yes!

Is the temperature below 80 ˚F?

Is it snowing?

Yes!

No

Sleepy πŸ’€

Playful!

No

63 of 259

Will Atlas want to play outside?

63

Yes!

Is the temperature below 80 ˚F?

Is it snowing?

Did he play outside yesterday?

Yes!

No

Sleepy πŸ’€

Playful!

No

No

Playful!

Yes!

…

64 of 259

What is a decision tree?

An intuitive supervised learning method for classification (discrete) and regression (continuous variable) tasks

Ask a series of questions to discriminate labels (i.e., classifications)

64

Courtesy: Python Data Science Handbook

Increasing depth of tree β†’

65 of 259

Decision tree structure

65

Root node

Leaf

Branch node

Branch

Tree depth

Nodes are built based on features (predictors) that best describe the classes/labels

shapeofdata.wordpress.com/2013/07/09/random-forests/

66 of 259

Splitting and stopping nodes

Splitting a node: Child nodes are constructed from additional features that best separate the classes/labels through the depth of the tree

  • Use a β€œgain” metric to assess which feature is best for splitting
  • Maximize purity of node

Stopping at leaf node: Once a stopping criteria is reached (e.g., number of samples at the node, purity), branch is complete

66

67 of 259

There are three common gain metrics

Maximize information gain = minimizing impurity:

  • Gini
  • Entropy
  • Variance reduction – only for regression

Greedy algorithm: Make the locally optimal decision at each node

67

More impurity

More purity

68 of 259

Advantages of Decision Trees

  • Easy to understand (interpretable)
  • Categorical and continuous variables
  • Implicit feature selection
  • Fewer tunable parameters than many other ML methods

68

codeproject.com/Articles/4047359/Step-by-Step-Guide-To-Implement-Machine-Learning-2

69 of 259

Disadvantages of Decision Trees

  • Easy to overfit with complex trees
  • Greedy algorithm is not always globally optimal
  • Deterministic – will always produce the same tree for a given set of training data

69

70 of 259

Reduce overfitting and improve predictions

  • Tree depth
  • Minimum number of samples to split
  • Minimum number of samples for leaf
  • Training dataset length and selection
  • K-fold cross-validation
  • Ensemble methods: random forest!

70

training

validation

Contiguous chunks

71 of 259

Random forests are decision tree ensembles!

71

Forest icon made by Freepik from flaticon.com

Observed data/events:

Other meteorological data

Predictors:

Temperature, humidity, insolation, wind speed, wind direction, precipitation, date of year, etc.

Output

Ozone class (good, fair, poor)

Each tree in the forest is a series of questionsοΏ½Is temperature above 80?

Is the wind from the west?

Is it raining?

Relative frequency of a label across all trees describes probabilistic occurrence (i.e. forecast)

72 of 259

What is a random forest?

-An ensemble of unique decision trees that become uncorrelated by selecting random subsets of features at each node

-Bagging instead of boosting

-Random sampling of training data

-Aggregated decision trees can significantly decrease the prediction variance with small increases to model bias (Herman and Schumacher 2016)

72

medium.com/@shvbajpai/how-to-master-python-for-machine-learning-from-scratch-a-step-by-step-tutorial-8c6569895cb0

73 of 259

Random Forest Development

Overfitting can still occur, particularly if a small subset of features are great predictors

Bagging, aka bootstrapping:

  1. Sample with replacement n training examples for each tree, B times
  2. Random subset of m features used at each decision node

Unique decision trees that are then uncorrelated

73

Data

Sub samples

Decision trees

B1

B2

BN

...

...

...

74 of 259

Random Forest Tunable Parameters

  • Number of trees – smaller forests may lead to poor performance, but asymptotes to a limit
  • Tree depth – deeper trees can fit more complex behavior, but can lead to overfitting
  • Leaf samples – can help prevent overfitting

This is not an exhaustive list but some of the most notable in our experience!

74

75 of 259

Why are RFs good for the job?

  • Probabilistic prediction versus deterministic
  • Maintain the same intuitive structure for users as decision trees
  • Relatively simple to implement and tweak (few tunable parameters compared to other ML techniques)
  • Computationally cheap

75

76 of 259

Where are they used?

76

77 of 259

Where are they used?

77

78 of 259

Where are they used?

78

DOI: 10.1175/MWR-D-17-0250.1

79 of 259

Now we will play with sample code!

Estimate ozone air quality class at Joshua Tree National Park using meteorological data

Try to beat the baseline! Validation weighted categorical accuracy = 0.537

Link: rf_ozone_joshuatree

79

80 of 259

Artificial Neural Networks (ANN)

β€œIt’s all connected”

80

81 of 259

ANNs allow us to utilize a lot of information easily

81

data/input

prediction/output

Nonlinear function

ANN

?

82 of 259

So how does an ANN work?

ANNs consists of multiple nodes (the circles) and layers that are all connected and using basic math gives out a result.

These are called feed forward networks!

82

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6

83 of 259

So what happens in the hidden layers?

In each individual node the values coming in are weighted and summed together and bias term is added and activation.

Linear regression with non-linear mapping by an β€œactivation function”

83

https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6

84 of 259

Activation function?

Activation function determines, if information is moving forward from that specific node.

This is the step that allows for nonlinearity in these algorithms, without activation all we would be doing is linear algebra!

84

85 of 259

So how do we train a network?

So training of the network is merely determining the weights β€œw” and bias/offset β€œb" with the addition of nonlinear activation function.

85

Goal is to determine the best function so that the output is as correct as possible; typically involves choosing β€œweights”

86 of 259

How do we define β€œbest” function?

This is where domain scientist can be very helpful!

You know the data and the goal you’re working towards, so you know the best, which loss function to use.

Basic MSE or MAE works well for regression tasks!

86

87 of 259

Cost/Loss

Function

Find a quantity you want to minimize (the β€œloss”) to help determine the weights

87

loss/cost

ANN prediction

truth/actual

88 of 259

How do we find the minima in real life?

Let’s start with an easy linear example

88

An example loss function (MSE)

89 of 259

Gradient Descent

89

When hiking in Colorado, if the path up isn’t clear you choose the steepest incline, and you will find the top. Gradient descent is the same principle but reversed.

An example loss function (MSE)

Is a technique to find the weight that minimizes the loss function.

This is done by starting with a random point, the gradient (the black lines) is calculated at that point. Then the negative of that gradient is followed to the next point and so on. This is repeated until the minimum is reached.

https://www.ibm.com/cloud/learn/gradient-descent

90 of 259

Gradient descent mathematically

90

So, let’s think our example loss function J. The gradient descent formula tells us that the next location depends on the negative gradient of J multiplied by the learning rate π›Œ.

As the RMSE (our loss function) depends on the linear function and its weights a0 and a1, the gradient is calculated as partial derivatives with relation to the weights

https://www.ibm.com/cloud/learn/gradient-descent

91 of 259

Partial derivatives for RMSE (based on linear regression)

91

https://www.ibm.com/cloud/learn/gradient-descent

92 of 259

Gradient descent mathematically

After taking the derivatives, the rest is easy!

92

The only other thing one must pay attention to is the learning rate π›Œ (how big of a step to take). Too small and finding the right weights takes forever, too big and you might miss the minimum.

https://www.ibm.com/cloud/learn/gradient-descent

93 of 259

Code it up!

93

94 of 259

Code it up!

94

# of iterations through the data

fit gets better with number of epochs

95 of 259

Code it up!

95

# of iterations through the data

You can do this for polynomials too!

fit gets better with number of epochs

96 of 259

So far so good - this all looks super easy! What’s the catch?

We calculated the gradients by hand because we knew the functional form we wanted to fit (a polynomial). οΏ½οΏ½οΏ½οΏ½

96

97 of 259

So far so good - this all looks super easy! What’s the catch?

We calculated the gradients by hand because we knew the functional form we wanted to fit (a polynomial). οΏ½οΏ½οΏ½οΏ½But the whole point of an ANN is that you do not need to assume the optimal functional form of the equation ahead of time - it’s probably really complex!

So what do we do now…?

97

98 of 259

We train the model!

  • Step 1- Forward Pass: weights/biases are frozen the ANN ingests the input and makes a predictionοΏ½
  • Step 2 - Calculate Error/Loss: compute the error in the ANN’s predictionοΏ½
  • Step 3 - Backward Pass: update the weights via gradient descent and backpropagation to move downgradient οΏ½
  • Rinse and repeat

98

99 of 259

Backpropagation

  • Short for β€œback propagation of errors”
  • A method for determining the gradient of the loss function when you don’t know the functional form
  • The method calculates the gradient of the loss function with respect to the neural network's weights.
  • Why are only certain activation functions are allowed - must be differentiable!

99

What we need to know!

The partial derivative of the PREDICTED y with respect to the weights w.

100 of 259

Backpropagation

100

computational graph

Relies on chain rule

101 of 259

Backpropagation

β€œby hand”

101

We encourage practicing backpropagation with pen and paper to help with visualizing the β€œblack box” of ML calculations

want this

102 of 259

You hopefully never οΏ½have to do that again!

Hurray for computers!

102

103 of 259

Great! Now you know how an ANN works!

Well…

Next we will talk about ANN architecture and decisions related to how to design an ANN

103

104 of 259

Some terminology

WHAT ML CALLS IT

Number of samples: number of individual realizations you have

Batch size: number of samples in a β€œchunk” of training data that is iterated before updating the weights and biases

Epoch: number of times you train on all batches (number of times you go through your entire training set)

Deep Learning: training an ANN with two or more layers

104

105 of 259

Choices you need to make...

Feed-forward ANNs require choices by the user:

  • Architecture (number of nodes and layers)
  • Gradient descent algorithm
  • Learning Rate
  • Activation Function
  • Batch size
  • Number of epochs
  • Choice of regularization and associated parameters
  • ...

105

Lions, and Tigers and Parameters… Oh My!

While this may seem like a lot of parameters, we deal with parameter choices every day…

  • How to define the seasonal cycle
  • What convective scheme to use
  • Width of a moving average

106 of 259

ANN considerations

106

Train the data the right number of times (epochs).

If we iterate too few times, our model will not have time to find the optimal fit. If we iterate too many times, we will overfit the training data.

Choosing the right number of layers and nodes for the given job is crucial.

Fewer layers/nodes allow for easier interpretability. More layers/nodes allow for more nonlinearity.

Choose an activation function most appropriate for your solution.

ReLu is a popular option.

107 of 259

Network Architectures

The way we put the ingredients, or pieces, together is called our network β€œarchitecture”.

  • Input units/nodes/neurons
  • Output units/nodes/neurons
  • Weights
  • Bias
  • Activation function
  • Layers

107

Regression Problem

108 of 259

Network Architectures

The way we put the ingredients, or pieces, together is called our network β€œarchitecture”.

  • Input units/nodes/neurons
  • Output units/nodes/neurons
  • Weights
  • Bias
  • Activation function
  • Layers

108

Classification Problem

https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax

109 of 259

Softmax: Let’s make the output a probability!

109

https://vitalflux.com/what-softmax-function-why-needed-machine-learning/

  • Converts β€œraw” output to probabilities for classification tasks

  • After the softmax, output values are between 0 and 1

  • The class with the highest probability gives your answer

110 of 259

Gradient Descent in ML, Learning rate

Identify the right learning rate.

A learning rate too large may cause the solution to skip between local minima.

A learning rate too small may get stuck in one local minimum, and also may take a longer to learn.

110

https://www.jeremyjordan.me/nn-learning-rate/

111 of 259

Gradient Descent in ML, Learning rate

Identify the right learning rate.

A learning rate too large may cause the solution to skip between local minima.

A learning rate too small may get stuck in one local minimum, and also may take a longer to learn.

111

https://cs231n.github.io/neural-networks-3/

112 of 259

Gradient Descent in ML, choices

  • Stochastic Gradient Descent (SGD): determine gradient estimate from a small mini-batch of the data to speed up computationοΏ½
  • Many choices of exactly how to do this existοΏ½
  • This is something the user much choose (a β€œparameter” of the model)

112

Figure by Alec Radford

113 of 259

Gradient Descent in ML, choices

  • Utilizing Stochastic Gradient Descent will help in getting close to global minimum instead of getting stuck at a local one
  • Also faster than Batch Gradient Descent

113

https://www.ibm.com/cloud/learn/gradient-descent

114 of 259

Let’s Talk About Overfitting...

An ANN often has thousands (or millions) of weights to adjust

114

inputs

hiddenοΏ½layer 1

hiddenοΏ½layer 2

two outputs

each connection = a weight that must be β€œlearned”

115 of 259

Let’s Talk About Overfitting...

An ANN often has thousands (or millions) of weights to adjust

This often leads to egregious overfitting

Two common remedies:

  • L1 & L2 Regularization
  • Drop-out

115

The black line fits the data well, whereas the green one is

overfitting.

https://elitedatascience.com/overfitting-in-machine-learning

116 of 259

Regularization:

L2 aka Ridge Regression

A regression tool to help prevent overfitting.

Adds a term to the loss/error function that penalizes the weights if they are too large.

Keeps the weights small.

116

Ridge Regression (β€œL2 Regularization”)

Shrinkage TermοΏ½Forces the individual coefficients to be small

>> sharing of weight across coefficients

Ridge ParameterοΏ½How much to penalize large weights

will force coefficients to share importance

117 of 259

Regularization:

L1 aka LASSO Regression

A regression tool to help prevent overfitting.

Adds a term to the loss/error function that penalizes the weights if they are too large.

Keeps the weights small.

117

LASSO Regression (β€œL1 Regularization”)οΏ½(least absolute shrinkage and selection operator)

Shrinkage TermοΏ½Punishes high values but actually sets them to exactly zero if not important

>> helpful for determining which features are most important

LASSO ParameterοΏ½How much to penalize large weights

will force some coefficients to zero

/

/

/

/

/

/

0

0

0

0

0

0

118 of 259

Dropout

During training (and training only!), randomly remove or β€œdrop out” nodes, requiring that the ANN learn to still make accurate predictions in spite of losing these nodes. Dropout has the effect of forcing nodes within a layer to probabilistically take on more or less responsibility for the inputs.

118

  • requires the user chose the percent to drop-out (often ~50-80%)
  • requires more epochs during training

119 of 259

At last, testing

Training an ANN is all about updating the weights and biases.

This is done iteratively by comparing to the β€œtruth”.

Once this is done and you are happy with the performance of your model, it’s time for testing!

Testing then involves freezing the weights, and using the ANN as a predictor function.

119

120 of 259

ANN Coding Examples

Two examples

120

121 of 259

Predicting a Sine Wave with an ANN

121

We are going to train an ANN to predict a sine wave

  • Input x between -1 and 1
  • Output sin(x)

The function we are predicting is the solid line, and the points we use indicated in purple

122 of 259

Predicting a Sine Wave with an ANN

122

Now open the code in google colab, and find the cell titled Set some neural network parameters

123 of 259

Predicting a Sine Wave with an ANN

123

Now open the code in google colab, and find the cell titled Set some neural network parameters

This is where we set our architecture and hyper-parameters. To start with, enter the following values

lr = 0.01

batch_size = 32

n_epochs = 400

activation = β€˜sigmoid’

hiddens = [10,100]

loss = β€˜mae’

124 of 259

Predicting a Sine Wave with an ANN

124

Run the code!

125 of 259

Predicting a Sine Wave with an ANN

125

How does the model do?

  • On the left we plot the model loss during training, i.e. the loss at the end of each epoch

126 of 259

Predicting a Sine Wave with an ANN

126

How does the model do?

  • On the left we plot the model loss during training, i.e. the loss for at the end of each epoch
  • Model finds a local minimum
  • Model finds a better local minimum

127 of 259

Predicting a Sine Wave with an ANN

127

How does the model do?

  • On the right we plot the network’s prediction of the testing data
  • Network is able to fit this unseen data

128 of 259

Predicting a Sine Wave with an ANN

128

Play with the network parameters

  • Navigate to the cell titled Set some neural network parameters
  • You can adjust anything in this cell and then run the code to see how it affects the network
  • Suggestions:
    • Change activation to β€˜relu’
    • Increase batch size up to 256
    • Decrease learning rate to 0.001 (or even smaller)

129 of 259

Predicting a Sine Wave with an ANN

129

A fun example of the effect of auto-correlation between training validation and testing

  • In the previous example, we randomly grabbed the training, validation and testing sets from the data

130 of 259

Predicting a Sine Wave with an ANN

130

A fun example of the effect of auto-correlation between training validation and testing

  • In the previous example, we randomly grabbed the training, validation and testing sets from the data

  • What happens if we change this so the data is picked in chunks?

131 of 259

Predicting a Sine Wave with an ANN

131

A fun example of the effect of auto-correlation between training validation and testing

Change this so that

randomselect = 0

132 of 259

Predicting a Sine Wave with an ANN

132

A fun example of the effect of auto-correlation between training validation and testing

  • Now run the notebook again with the original parameters

lr = 0.01

batch_size = 36

n_epochs = 400

activation = β€˜sigmoid’

hiddens = [10,100]

loss = β€˜mae’

133 of 259

Predicting a Sine Wave with an ANN

133

A fun example of the effect of auto-correlation between training validation and testing

  • Now run the notebook again with the original parameters

lr = 0.01

batch_size = 36

n_epochs = 400

activation = β€˜sigmoid’

hiddens = [10,100]

loss = β€˜mae’

134 of 259

Predicting a Sine Wave with an ANN

134

A fun example of the effect of auto-correlation between training validation and testing

  • Now run the notebook again with the original parameters
  • Change activation to β€˜relu’

lr = 0.01

batch_size = 36

n_epochs = 400

activation = β€˜relu’

hiddens = [10,100]

loss = β€˜mae’

135 of 259

Predicting a Sine Wave with an ANN

135

A fun example of the effect of auto-correlation between training validation and testing

  • Now run the notebook again with the original parameters
  • Change activation to β€˜relu’

lr = 0.01

batch_size = 36

n_epochs = 400

activation = β€˜relu’

hiddens = [10,100]

loss = β€˜mae’

136 of 259

Predicting ENSO with an ANN

Another example we know the answer to.

136

137 of 259

Defining ENSO...

137

Toms et al., JAMES (2020)

Figure courtesy of NCAR Climate Data Guide

NiΓ±o 3.4 index

ENSO is commonly defined according to average sea-surface temperatures within the central tropical Pacific.

138 of 259

ENSO + Neural Networks

138

Toms et al., JAMES (2020)

139 of 259

Predicting ENSO with an ANN

139

Include only samples where nino event is occuring (i.e. nino3.4>0.5 or nino3.4<-0.5)

Split into train, validation, and test sets by date:

140 of 259

Predicting ENSO with an ANN

140

Now open the code in google colab, and find the cell titled Set some neural network parameters

Once, you have set the parameters, run the code!

To start with, use the following values:

hiddens = [12, 12]

ridgepen = 1

lr = 1e-3

n_epochs = 20

batch_size = 32

activation = β€˜relu’

loss = β€˜categorical_crossentropy’

141 of 259

Predicting ENSO with an ANN

141

How does the model do?

  • Left: loss during training
    • decreases over time - good!

  • Right: accuracy during training
    • similar accuracy for training and validation data - good!

142 of 259

Predicting ENSO with an ANN

142

How does the model do on test data?

143 of 259

Predicting ENSO with an ANN

143

How does the model do on test data?

Confusion matrix:

144 of 259

Predicting ENSO with an ANN

144

How does the model do on test data?

145 of 259

Predicting ENSO with an ANN

145

Now go back to the neural network parameters.

Re-train the network using different parameters.

How do the predictions change?

146 of 259

Advanced ANN Techniques

146

147 of 259

Convolutional NN

Cat or Dog?

147

Convolutional Layer

Benefits:

  1. preserve spatial info
  2. fewer weights & biases (train faster and lower chance of overfitting)

148 of 259

Convolutional NN

Cat or Dog?

148

It has cat ears - it’s a cat

Convolutional Layer

149 of 259

Convolutional NN

Used for image detection

    • Neurons/Nodes are now 2D matrices i.e. FILTERS
    • These FILTERS detect patterns in the input image

Initial Training:

β€’Edges, Corners, Shapes (circles, squares)

More Training:

β€’Objects (Ears, Eyes)

Even More Training:

β€’Complex objects (cats, dogs)

149

150 of 259

Filters

150

151 of 259

Filters

151

152 of 259

Convolving

152

153 of 259

Convolving

153

154 of 259

CNN layer for a single multi-channel kernel

Perform convolution across channels

Sum across channels

Add bias term

155 of 259

Zero padding in action

156 of 259

Connecting CNNs to Fully Connected Networks

156

Example courtesy of Imme Ebert-Uphoff

157 of 259

Pooling Layer

  • Reduce spatial size
  • Noise suppressant
  • Max Pooling/Average/Sum Pooling

157

158 of 259

Seasonal ENSO prediction

158

Task: Predict Nino3.4 index __ months into the future using maps of the ocean state

Figure from Ham et al. (2019)

159 of 259

Advantages of CNNs

  • Easily retains the spatial relationships between adjacent pixels
    • CNNs are very common for machine learning tasks that use images as inputsοΏ½
  • Fewer weights and biases which speeds up training and reduces overfitting
    • The network only has to learn a small set of kernels and then apply them to every part of the image (rather than different weights/kernels for each part)

159

160 of 259

Satellite Image -> Synthetic Radar Image

https://journals.ametsoc.org/view/journals/apme/60/1/jamc-d-20-0084.1.xml

Kyle Hilburn

[CSU/CIRA]

161 of 259

U-Net Architecture

https://journals.ametsoc.org/view/journals/apme/60/1/jamc-d-20-0084.1.xml

Kyle Hilburn

[CSU/CIRA]

162 of 259

Inputs + Labels

https://journals.ametsoc.org/view/journals/apme/60/1/jamc-d-20-0084.1.xml

Kyle Hilburn

[CSU/CIRA]

163 of 259

Inputs + Labels

https://journals.ametsoc.org/view/journals/apme/60/1/jamc-d-20-0084.1.xml

Kyle Hilburn

[CSU/CIRA]

164 of 259

Results!

https://journals.ametsoc.org/view/journals/apme/60/1/jamc-d-20-0084.1.xml

Kyle Hilburn

[CSU/CIRA]

  • MRMS = Satellite
  • GREMLIN = CNN output

Truth

CNN Prediction

165 of 259

Image Segmentation with a CNN

Before training…

After training…

  • Class 1: Pixel belonging to the pet.
  • Class 2: Pixel bordering the pet.
  • Class 3: None of the above/a surrounding pixel.

166 of 259

ClimateNet - using experts to label

167 of 259

ClimateNet - using experts to label

https://gmd.copernicus.org/articles/14/107/2021/

168 of 259

Image Segmentation w/ ClimateNet

https://gmd.copernicus.org/articles/14/107/2021/

169 of 259

Image Segmentation w/ ClimateNet

https://gmd.copernicus.org/articles/14/107/2021/

Truth (Labeled)

Predicted by Trained CNN οΏ½Trained on ClimateNet

170 of 259

Autoencoders

170

https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798

Even more advanced techniques

171 of 259

Autoencoders

171

https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798

Even more advanced techniques

If your activation function is linear, this becomes the PCs of the leading EOFs

172 of 259

Generative Adversarial Networks

172

Figure by Atienza, Rowel. Advanced Deep Learning with Keras: Apply deep learning techniques, autoencoders, GANs, variational autoencoders, deep reinforcement learning, policy gradients, and more. Packt Publishing Ltd, 2018.

Even more advanced techniques

  • First introduced by Goodfellow et al. (2014)οΏ½
  • Competition between generator and discriminator

173 of 259

Visualizing Climate Change & Associated Hazards

174 of 259

Style GANs generated these faces. They are not real.

175 of 259

Cycle GAN

176 of 259

Cycle GAN

177 of 259

GANs for Nowcasting

Fig. 1: Model overview and case study of performance on a challenging precipitation event starting on = 24 June 2019 at 16:15 UK, showing convective cells over eastern Scotland.

178 of 259

Instructions for installing Tensorflow on an Apple M1 Chip

178

https://makeoptim.com/en/deep-learning/tensorflow-metal

https://developer.apple.com/metal/tensorflow-plugin/

https://github.com/apple/tensorflow_macos

https://github.com/conda-forge/miniforge

In a nutshell:

  • Install Xcode and Command Line Tools
  • Install miniforge (Anaconda does not run on the M1)
  • Then, open a terminal and do the following:
    • conda create --name env-name python=3.9
    • conda activate env-name
    • conda install -c apple tensorflow-deps==2.7
    • python -m pip install tensorflow-macos==2.7

β€”---------

    • pip install tensorflow-probability==0.15
    • pip install --upgrade numpy scipy pandas statsmodels matplotlib seaborn palettable progressbar2 tabulate icecream flake8 keras-tuner jupyterlab black isort jupyterlab_code_formatter
    • pip install -U scikit-learn
    • pip install silence-tensorflow tqdm
    • conda install -c conda-forge cmocean cartopy xarray dask netCDF4

179 of 259

Ethical Use of AI in Earth Science

179

180 of 259

A fun motivational story

180

181 of 259

A fun motivational story

181

Lehman et al. (2020) https://doi.org/10.1162/artl_a_00319

182 of 259

Using AI to hire new workers

Goal: Feed an AI all resumes and have the AI return the top candidates

182

183 of 259

Using AI to hire new workers

Goal: Feed an AI all resumes and have the AI return the top candidates

Problem: AI had a gender bias no matter how it was trained

183

184 of 259

AI Ethics: Using AI methods in an ethical manner as to not cause harm to others. Accomplished through careful thought in data processing, AI development, and deployment.

184

185 of 259

185

186 of 259

186

187 of 259

Example of biased data in Earth science

Radar coverage is unequal

187

188 of 259

Bias can be introduced unintentionally!

  • The fill value used for missing data can alter the model’s prediction

188

Chase and McGovern (2022) β€œDeep Learning Parameter Considerations When Using Radar and Satellite Measurements”

189 of 259

What can you do?

  • Get to know your data
    • How was it collected?
    • Unequal samples?
    • Fully representative?
  • Consider how you create and train your model
    • What assumptions are your model based on?
  • After training spend time figuring out what your model has learned and why it is making its decisions
    • Explainability methods can help
  • Be intentional!!!

189

190 of 259

Implementation and Assessment of an ANN

190

191 of 259

Different regularization techniques

  • L2 (Ridge) Regression
    • Forces weights to share importance

  • L1 (LASSO) Regression
    • Penalizes high weights; sets low weights to zero

  • Dropout

191

192 of 259

L2 Regularization: an example

192

193 of 259

Performance Measures

Confusion Matrix: summary of prediction results

193

*Note- these are only used for classification problems!

Assessment Scores:

Precision = TP / (TP + FP); (between 0 and 1)

Accuracy of positive predictions

Recall = TP/ (TP + FN); (between 0 and 1)

Ratio of positive instances correctly detected by classifier

F1 score: harmonic mean of precision and recall; (between 0 and 1)

194 of 259

Performance Measures

Receiver Operating Characteristic (ROC)

194

  • Plots the false positive rate against the true positive rate for all possible thresholds
  • Further from red dotted line, the better the model

  • True positive rate = ratio of positive instances correctly detected by classifier (i.e. recall)
  • False positive rate = ratio of negative instances that are incorrectly classified as positive

195 of 259

Performance Measures

Receiver Operating Characteristic (ROC)

195

Assessment Scores: measure the Area Under the Curve (AUC)

  • Ranges from 0 to 1
  • Higher AUC β†’ better* model

*remember, you (the scientist) decides what better means!

196 of 259

Performance Measures

Loss & Accuracy Assessment: User- Defined Loss Function

Overfit Model: training outperforms validation

196

197 of 259

Performance Measures

Loss & Accuracy Assessment: User- Defined Loss Function

Overfit Model: training outperforms validation

Generalized Model! training matches validation

197

198 of 259

Permutation Importance

198

A way to detect features that are more important for accurate predictions, Brieman (2001)

For each feature, the feature values are shuffled, and new predictions made (with the shuffled feature values included).

If the model error increases after the shuffling, this feature is considered important. If there is no change in the model error, the feature in question adds no value to the model prediction.

https://christophm.github.io/interpretable-ml-book/feature-importance.html

199 of 259

Permutation Importance

199

What if x is highly correlated with another predictor? β†’ Multi-pass permutation

Freeze initially permuted predictor that degrades model performance the most, and re-permute other predictors to select second frozen predictor, and continue until all features are permuted and frozen

In comparison to the baseline model skill, can determine which predictors impact skill the most

Rank predictors based on error increase after permutation

Lakshmanan et al. (2015)

McGovern et al. (2019)

200 of 259

Time to code!

https://github.com/eabarnes1010/ml_tutorial_csu

From Github, open ann_ozone_joshuatree_metrics.ipynb

200

201 of 259

Methods of Explainable AI (XAI) for ANNs

201

202 of 259

Overview

1) Introduction to XAI:

i) Motivation for XAI

ii) The general idea of how XAI works

iii) Opportunities that XAI brings

iv) Representative methods and categories of XAI

2) Popular XAI methods:

i) Gradient

ii) Input*Gradient

iii) Layer-wise Relevance Propagation

ii) SHAP – SHapley Additive exPlanations

3) Benchmarking XAI:

i) Motivation - General idea

ii) Examples

4) Summary

202

203 of 259

Introduction to XAI

Introduction to XAI

204 of 259

Why is XAI necessary?

204

Scientists need to understand what the AI model is doing; what the decision-making process is.

Linear model: inherently interpretable

Neural Network: not inherently interpretable

205 of 259

Why is XAI necessary?

205

From Adebayo et al. (2020)

Methods of eXplainable Artificial Intelligence (XAI) aim to explain how a Neural Network makes predictions, i.e., what the decision strategy is.

XAI methods highlight which features in the input space are important for the prediction: They produce the so-called explanation/relevance heatmaps.

Explanation (or relevance) Heatmap

206 of 259

Why is XAI necessary?

206

Methods of eXplainable Artificial Intelligence (XAI) aim to explain how a Neural Network makes predictions, i.e., what the decision strategy is.

XAI methods highlight which features in the input space are important for the prediction: They produce the so-called explanation/relevance heatmaps.

tabby cat

white wolf

ram

black widow

Explanation (or relevance) Heatmap

Network Input

site:

Any questions?

207 of 259

207

XAI: A potential game changer for prediction in Earth Sciences

XAI may help accelerate establishing new science, like investigating new climate teleconnections and gaining new insights.

From Mamalakis et al. (2022)

XAI helps calibrate model trust and physically interpret the network, which is a necessity in many applications in Earth Sciences.

XAI may help fine-tune and optimize the strategy of a flawed model

208 of 259

208

XAI methods and categories

AI models

Interpretable models

Post-hoc Explainable models

Global XAI methods

Local XAI methods

sensitivity

attribution

(e.g., optimal input, permutation importance)

(e.g., Gradient)

(e.g., LRP, SHAP)

(e.g., linear models, decision trees)

From Samek et al. (2021)

. . .

209 of 259

Introduction to XAI

Popular XAI Methods

210 of 259

210

Gradient (sensitivity)

  • Sensitivity refers to how much sensitive the value of the output is to a specific input feature. It is essentially the gradient (i.e., the first derivative if we think the network as a function) of the output with respect to the input. [units output/units input]

Network

Value of feature i in sample n

Relevance of feature i for prediction n

Partial derivative

211 of 259

211

Β 

Analogy in the simple pendulum setting

Β 

Ball’s velocity

Β 

Β 

Ball’s acceleration

Input*Gradient (attribution)

  • Attribution refers to the relative contribution of a specific input feature to the output. [units output]

Gradient

Relevance of feature i for prediction n

Input

212 of 259

212

Β 

LRPz :

Other popular LRP rules :

Β 

Β 

Β 

From Bach et al., (2015)

LRP: Layerwise Relevance Propagation (attribution)

Relevance of neuron i in layer l

Preactivation from i to j

Any questions?

213 of 259

213

Consider the general class of explanation models:

Any XAI method that can be represented as in Eq. (1), we will say it is an additive feature attribution method.

In other words, the best solution to (1) is to use Shapley values; LRP is not the best solution to (1).

LRP and other popular XAI methods (e.g., LIME, DeepLIFT) are essentially different solutions to Eq. (1).

From Lundberg and Lee (2017)

SHAP: SHapley Additive exPlanations (attribution)

network

attribution to feature i

input

214 of 259

214

prediction

Β 

Β 

Β 

Β 

Β 

Β 

SHAP (attribution)

215 of 259

215

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

SHAP (attribution)

216 of 259

216

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

SHAP (attribution)

Β 

Β 

217 of 259

217

Β 

Β 

SHAP (attribution)

For problems with high dimensions (e.g., more than 10 features), the number of times one would need to retrain and evaluate the model in order to calculate the shapley values is extremely high. Computationally, this is not efficient, so the SHAP method uses an approximate algorithm (Deep SHAP), specifically designed for deep neural networks. Deep SHAP is similar to LRP, except that instead of propagating the relevance, it propagate the Shapley values.

Also, there is no model retraining in Deep SHAP. When a specific feature (or neuron output) needs to be withheld (i.e., to be considered missing), it is replaced with a background value, which is usually the average value in the training dataset.

218 of 259

218

SHAP Example on MNIST

CNN

Class 0

Class 1

Class 2

.

.

.

Class 8

Class 9

Input

Class 0

Class 1

Class 2

Class 3

Class 4

Class 5

Class 6

Class 7

Class 8

Class 9

Any questions?

219 of 259

Introduction to XAI

Benchmarking XAI

220 of 259

220

XAI method

The need for objectivity in assessing XAI

Issues : 1) No ground truth to assess the estimated explanations.

Which input features were important for this classification?

XAI heatmap

Debagging the phrase: β€œThe explanation looks reasonable”

Remember: The human perception of the explanation alone is NOT a solid criterion for its trustworthiness.

From Adebayo et al. (2020)

221 of 259

221

Issues : 1) No ground truth to assess the estimated explanations.

2) Different methods provide different answers.

This is problematic: The uncertainty on how the network decides, leads to limited trust when using neural networks in environmental problems.

Many Different XAI methods

We need objective frameworks to rigorously assess XAI methods and gain insights about relative strengths and weaknesses.

From Adebayo et al. (2020)

The need for objectivity in assessing XAI

Which input features were important for this classification?

222 of 259

222

From Mamalakis et al. (2021)

Attribution benchmarks for XAI

223 of 259

223

From Mamalakis et al. (2021)

Regression Benchmark - Fully Connected Network

Any questions?

224 of 259

224

From Mamalakis et al. (2022)

Classification Benchmark - Convolutional Network

225 of 259

225

From Mamalakis et al. (2022)

Classification Benchmark - Convolutional Network

226 of 259

Best practices of XAI

226

From Mamalakis et al. (2022)

227 of 259

Introduction to XAI

Summary

228 of 259

Key take home messages

  • XAI methods show potential to be a game-changer in how we predict/detect patterns in Earth Sciences. We can use these tools to calibrate model trust, fine-tune models and learn new science.

  • We briefly went through some popular XAI methods in Earth Sciences: Gradient, Input*Gradient, Layer-wise Relevance Propagation (LRP) and SHapley Additive exPlanations (SHAP).

  • Given the plethora and the diversity of methods out there, the lack of a ground truth to assess their fidelity has the risk of allowing subjective assessment, and cherry-picking certain methods. It is important to introduce objectivity in XAI assessment and shed light to relative strengths and weaknesses.

  • Engagement of attribution benchmarks may lead to a more cautious and successful implementation of XAI methods.

229 of 259

References

229

J. Adebayo et al. (2020) β€œSanity checks for saliency maps,” arXiv preprint, https://arxiv.org/abs/1810.03292.

Samek, et al. (2021), β€œExplaining Deep Neural Networks and Beyond: A review of Methods and Applications”, inΒ Proceedings of the IEEE, vol. 109, no. 3, pp. 247-278, March 2021, doi: 10.1109/JPROC.2021.3060483

Bach, et al. (2015) β€œOn Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation”, PLOS ONE, https://doi.org/10.1371/journal.pone.0130140

Other Resources

INNVESTIGATE

SHAP

Mamalakis, A., I. Ebert-Uphoff and E.A. Barnes (2021) β€œNeural Network Attribution Methods for Problems in Geoscience: A Novel Synthetic Benchmark Dataset”, arXiv preprint, https://arxiv.org/abs/2103.10005, accepted in Environmental Data Science.

Mamalakis, A., I. Ebert-Uphoff, E.A. Barnes β€œExplainable Artificial Intelligence in Meteorology and Climate Science: Model fine-tuning, calibrating trust and learning new science,” in Beyond explainable Artificial Intelligence by Holzinger et al. (Editors), Springer Lecture Notes on Artificial Intelligence, open access at: https://link.springer.com/chapter/10.1007/978-3-031-04083-2_16

Lundberg, S. M. and S. I. Lee (2017) β€œA unified approach to interpreting model predictions,” Proc. Adv. Neural Inf. Process. Syst., pp. 4768-4777.

My GitHub page

Mamalakis, A., E.A. Barnes, I. Ebert-Uphoff (2022) β€œInvestigating the fidelity of explainable artificial intelligence methods for applications of convolutional neural networks in geoscience”, arXiv preprint, https://arxiv.org/abs/2202.03407, accepted in Artificial Intelligence for the Earth Systems.

230 of 259

Simple Uncertainty Quantification

Can a network predict how right it is?

230

231 of 259

Uncertainty Quantification

231

For prediction problems in the Earth sciences, we usually want some sort of uncertainty or likelihood associated with the prediction.

E.g. in forecasting we generate ensembles

Image from https://www.metoffice.gov.uk/research/weather/ensemble-forecasting/decision-making

232 of 259

Uncertainty Quantification

232

For prediction problems in the Earth sciences, we usually want some sort of uncertainty or likelihood associated with the prediction.

E.g. in forecasting we generate ensembles

This allows us to quantify the likelihood of an event, as well as build trust in our models

Image from https://www.metoffice.gov.uk/research/weather/ensemble-forecasting/decision-making

233 of 259

Uncertainty Quantification in ML

233

We want to build uncertainty quantification into machine learning models for the same reasons:

  • Quantify the likelihood of an event
  • Build trust in our models

There will be 2 cm of snow this weekend!

There will be 2 Β± 2 cm of snow this weekend!

234 of 259

Uncertainty Quantification in ML

234

We want to build uncertainty quantification into machine learning models for the same reasons:

  • Quantify the likelihood of an event
  • Build trust in our models

Here we will go through one way of adding uncertainty to NNs but there are all sorts of other methods you may encounter

There will be 2 cm of snow this weekend!

There will be 2 Β± 2 cm of snow this weekend!

235 of 259

235

Input Layer

Hidden Layer(s)

Output Layer

Consider a simple regression task…

Predictands, xα΅’

Prediction, a specific number, yβ‚š

236 of 259

236

Input Layer

Hidden Layer(s)

Output Layer

Consider a simple regression task…

Based on ocean heat information…

…predict SST anomaly at a point in the ocean

237 of 259

237

Consider a simple regression task…

In a regression model, we want to train a network to minimize the error between its predictions and the truth,

Loss function could be mean absolute error, or mean squared error, e.g.

β„’ = |yβ‚š - y|

And we can make a scatter plot of the truth (y) vs predictions (yβ‚š)

yβ‚š

y

238 of 259

238

Consider a simple regression task…

Obviously some predictions are better than others

yβ‚š

y

Wouldn’t it be great if there was some way of knowing how good a prediction is

i.e. get the network to estimate an uncertainty range

239 of 259

Adding Uncertainty to Regression Tasks

Rather than the network outputting a single number as its estimate….

We want the network to output an estimate and an uncertainty range

239

yβ‚š

𝜎

prediction

associated uncertainty

240 of 259

Adding Uncertainty to Regression Tasks

Rather than the network outputting a single number as its estimate….

We want the network to output an estimate and an uncertainty range

240

yβ‚š

𝜎

We train the neural network to predict conditional distributions, i.e. each prediction is the parameters of a probability distribution

Predicted anomaly

Uncertainty range

Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250

241 of 259

Adding Uncertainty to Regression Tasks

Rather than the network outputting a single number as its estimate….

We want the network to output an estimate and an uncertainty range

241

𝜎

The simplest version of this is predicting a Gaussian – so predicting a mean (πœ‡) and standard deviation (𝜎)

πœ‡

Predicted anomaly

Uncertainty range

Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250

242 of 259

Adding Uncertainty to Regression Tasks

242

πœ‡

𝜎

Predicted anomaly

e.g. 1.3

Uncertainty

e.g. 1.2

Predicted πœ‡ and 𝜎 are used to construct a normal distribution

Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250

243 of 259

Adding Uncertainty to Regression Tasks

243

Y true

πœ‡

𝜎

Predicted anomaly

e.g. 1.3

Uncertainty

e.g. 1.2

Evaluate p, probability distribution function at the true value of the anomaly

p

Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250

244 of 259

Adding Uncertainty to Regression Tasks

244

Y true

πœ‡

𝜎

Predicted anomaly

e.g. 1.3

Uncertainty

e.g. 1.2

Loss function is defined as

β„’ = -log(p)

p

Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250

245 of 259

Adding Uncertainty to Regression Tasks

245

Y true

Loss function is defined as

β„’ = -log(p)

To minimize loss, network attempts to maximise p

p

Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250

246 of 259

Adding Uncertainty to Regression Tasks

246

Loss function is defined as

β„’ = -log(p)

To minimize loss, network attempts to maximise p

The network must hence learn to make good anomaly predictions, but also reasonable uncertainty estimates

Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250

247 of 259

Adding Uncertainty to Regression Tasks

247

Now predictions have an uncertainty range

Here we plot the 1𝜎 range associated with each prediction

  • for Day 2
  • for Day 2

Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250

248 of 259

A Fun and Important Point!

248

  • for Day 2
  • for Day 2

Loss function:

β„’ = -log(p)

This is simply a log likelihood!

The predicted distribution does not have to be a Gaussian!!

Consider for example the sinh-arcsinh (SHASH) normal distribution…

Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250

249 of 259

A Fun and Important Point!

249

  • for Day 2
  • for Day 2

The SHASH normal distribution

  • A distribution described by parameters
    • location (πœ‡)
    • scale (𝜎)
    • skewness (𝛾)
    • and tail weight (𝜏)

Examples of SHASH with tail weight set to 1 (𝜏=1)

250 of 259

A Fun and Important Point!

250

  • for Day 2
  • for Day 2

The SHASH normal distribution

  • A distribution described by parameters
    • location (πœ‡)
    • scale (𝜎)
    • skewness (𝛾)
    • and tail weight (𝜏)

Note if skewness=0, then we have a normal distribution!

251 of 259

Using SHASH

251

Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250

252 of 259

Application:

Predicting SST on decadal timescales

252

253 of 259

Predicting SST on decadal timescales

253

πœ‡

OHC to 100 m

OHC to 300 m

OHC to 700 m

𝜎

NEURAL NETWORK

(60,4)

INPUT LAYER

OUTPUT LAYER

254 of 259

Predicting SST on decadal timescales

254

πœ‡

𝜎

OUTPUT LAYER

Prediction of SST anomaly with uncertainty at some point in the ocean, 1-5 years later, e.g. North Atlantic Subpolar Gyre

53N, 35W

255 of 259

Identifying State-Dependent Predictability

255

The neural network identifies more predictability by assigning lower uncertainty values to input patterns that result in more predictable outcomes

For low uncertainty predictions, the ANN is more confident its prediction is closer to the truth

Prediction error (difference between truth and predicted anomaly) decreases as we sort by increasing confidence

More confident = lower uncertainty

256 of 259

What can we learn from this?

We train a neural network to predict SST in 1-5 years at every grid point in the ocean (i.e. one* neural network per grid point).

We can then compare the skill of our predictions across all samples in the testing…

And we can leverage the uncertainty/𝜎 predictions to identify increased predictability

256

257 of 259

Resources if you want more UQ

257

Code! There is a coding example in the github to go through in your own time.

Papers! We have a quick explainer on this method

  • Barnes, Elizabeth A., Randal J. Barnes and Nicolas Gordillo: Adding Uncertainty to Neural Network Regression Tasks in the Geosciences, https://arxiv.org/abs/2109.07250

And I have a paper in review on adapting this in my own (Emily Gordon) research

  • Gordon, Emily M. and Elizabeth A. Barnes: Incorporating Uncertainty into a Regression Neural Network Enables Identification of Decadal State-Dependent Predictability, https://www.essoar.org/doi/abs/10.1002/essoar.10510836.1

258 of 259

Main Takeaways

  • ML has many applications to Earth Science and can help us learn NEW things about the Earth system and solve problems faster and cheaper
    • We need lots of data and to think like a scientist!
  • We can open the black box of ML through visualization and understanding of the ML process to learn what the model learned
    • Explainable Artificial Intelligence (XAI) techniques are a potential game changer in Earth Science predictions
      • They aim to explain how a neural network makes predictions, i.e., what the decision strategy is

259 of 259

END