Applied Machine Learning for Earth Scientists
Elizabeth Barnes, Marybeth Arcodia, Charlotte Connolly, Frances Davenport, Zaibeth Carlo Frontera, Emily Gordon, Daniel Hueholt, Antonios Mamalakis, Elina Valkonen
1
Please cite these slides with the doi provided by Zenodo. Linked here.
Topics Covered
3
Basics of Machine Learning Applications to Earth Science
Visualization and Explainability of Machine Learning in Earth Science
Resources
4
Organizers
Elizabeth Barnes: Professor
Marybeth Arcodia: Postdoc
Antonios Mamalakis: Postdoc
Elina Valkonen: Postdoc
Frances Davenport: Postdoc
5
Organizers
6
Materials also sourced from Wei-Ting Hsiao, Jamin Rader, Aaron Hill, Imme Ebert-Uphoff, Kirsten Mayer, Ben Toms, Yoonjin Lee @ CSU
Emily Gordon: PhD Student
Charlie Connolly: MS Student
Website: https://sites.google.com/view/connolly-climate/home
Daniel Hueholt: MS Student
Zaibeth Carlo Frontera: MS Student
What you will learn here...
7
What you will not learn...
8
Machine Learning for Science
9
The βblack boxβ
10
data
prediction
We have a lot of data...
We are creating data faster than we can process.
11
Our field has a big toolbox...
12
EOF Analysis
Spectral
Analysis
Linear trend detection
Correlations
Dynamical Model Simulations
Machine Learning offers an additional set of tools for the job.
13
Machine Learning*
Machine Learning*
14
One of our jobs as scientists is to sift through piles of data and try to extract useful relationships that apply elsewhere, i.e. that are applicable βout of sampleβ.
This is what many machine learning methods are designed to do.
Commercial Applications
Machine learning has made huge inroads for commercial applications
For example, by the early 2000s convolutional neural networks processed 10-20% of all checks in the U.S.
The concept of machine learning has been around since the early 1950βs. The explosion in the past decade can be attributed to advances in training deep networks and the explosion of available data
Self-driving vehicles
Facial recognition
Natural language processing
https://blog.cloudsight.ai/chihuahua-or-muffin-1bdf02ec1680
Chihuahua or οΏ½Blueberry Muffin?
https://towardsdatascience.com/teaching-cars-to-see-vehicle-detection-using-machine-learning-and-computer-vision-54628888079a
Visage Technologies Ltd, Creative οΏ½Commons License, Wikimedia.org
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72
Commercial Applications
Machine learning has made huge inroads for commercial applications
For example, by the early 2000s convolutional neural networks processed 10-20% of all checks in the U.S.
The concept of machine learning has been around since the early 1950βs. The explosion in the past decade can be attributed to advances in training deep networks and the explosion of available data
Takes a winter image and turns it into summer
Commercial Applications
Machine learning has made huge inroads for commercial applications
For example, by the early 2000s convolutional neural networks processed 10-20% of all checks in the U.S.
The concept of machine learning has been around since the early 1950βs. The explosion in the past decade can be attributed to advances in training deep networks and the explosion of available data
It can fail spectacularly too!
Heaven, D., 2019: Why deep-learning AIs are so easy to fool. Nature, 574, 163β166.
Commercial Applications
Machine learning has made huge inroads for commercial applications
For example, by the early 2000s convolutional neural networks processed 10-20% of all checks in the U.S.
The concept of machine learning has been around since the early 1950βs. The explosion in the past decade can be attributed to advances in training deep networks and the explosion of available data
It can fail spectacularly too!
Heaven, D., 2019: Why deep-learning AIs are so easy to fool. Nature, 574, 163β166.
Science!
Even with its βblack boxβ persona, ML has already aided significant scientific advances
e.g., the area of bioinformatics has exploded in recent years due to machine learning and data mining
Distinguishing high-energy particles οΏ½in the Large Hadron Collider
Gene prediction and sequencing
Predicting properties of solar flares
Identifying drug-drug interactions
By Lucas Taylor / CERN - http://cdsweb.cern.ch/record/628469, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1433671
Science!
Even with its βblack boxβ persona, ML has already aided significant scientific advances
e.g., the area of bioinformatics has exploded in recent years due to machine learning and data mining
The number of science articles using supervised learning have seen large trends in recent year
However, AMS papers with supervised machine learning have lagged behind those across all of geoscience
Maskey, M., H. Alemohammad, K. J. Murphy, and R. Ramachandran (2020), Advancing AI for Earth science: A data systems perspective, Eos, 101, https://doi.org/10.1029/2020EO151245
ML for Climate Science
The fieldβs interest, and research, has exploded in the past ~3 years!
Applications of ML for atmospheric science dates back as far as the 1960βs!
21
1998
2004
1964
ML for Climate Science
The fieldβs interest, and research, has exploded in the past ~3 years!
Applications of ML for atmospheric science dates back as far as the 1960βs!
Range of applications:
22
Weather PredictionοΏ½e.g. Gagne et al. (2019); Gagne et al. (2017); Chattopadhyay et al. (2019); Lagerquist et al. (2020)οΏ½
Weyn et al. (2020)
Convective parameterizationsοΏ½e.g. Rasp et al. (2018; PNAS); Schneider et al. (2017; GRL); OβGorman and Dwyer (2018); Beucler et al. (2020; PRL)
Brenowitz and Bretherton (2018)
Extreme event detectionοΏ½e.g. Reichstein et al. (2019)
οΏ½
Equation discoveryοΏ½e.g. Zanna and Bolton (2020)οΏ½
ML for Climate Science
The fieldβs interest, and research, has exploded in the past ~3 years!
Applications of ML for atmospheric science dates back as far as the 1960βs!
Range of applications:
23
Commercial application: Infilling an image
NVIDIA ResearchοΏ½https://www.youtube.com/watch?v=gg0F5JjKmhA
ML for Climate Science
The fieldβs interest, and research, has exploded in the past ~3 years!
Applications of ML for atmospheric science dates back as far as the 1960βs!
Range of applications:
24
Commercial application: Infilling an image
Kadow et al. (2020; Nature Geoscience)
Also see for other reconstruction applications DelSole and Nedza (2020)
NVIDIA ResearchοΏ½https://www.youtube.com/watch?v=gg0F5JjKmhA
Evidence of the reportedοΏ½1877 El Nino
Climate Application: Temperature reconstruction (e.g. 1877)
ML for Climate Science
Communicating climate change is another area with great promise for ML
Groups are working on using deep learning to produce accurate and vivid renderings of the future outcomes of climate change
25
https://mila.quebec/en/ai-society/visualizing-climate-change/
Reasons to use ML for Science
26
Very relevant for research: may be slower and worse, but can still learn something
Example uses in Climate Science
27
Foundational Concepts of ML
28
What is Machine Learning?
Practically speaking:
Techniques for fitting data.
It is not restricted by a single or a set of simple mathematical expressions.
It can be as complicated by putting in procedures, judgement statements, even randomness...
29
Field of study that gives computers the ability to learn without being explicitly programmed.
Defined by Arthur Samuel (1959)
Types of machine learning
Unsupervised learningοΏ½looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision, e.g.
30
By I, Weston.pace, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2463085οΏ½https://www.vectorstock.com/royalty-free-vector/neural-network-vector-18470587
Types of machine learning
Unsupervised learningοΏ½looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision, e.g.
Supervised learningοΏ½maps an input to an output based on example input-output pairs, e.g.
31
There are other types, but these are the two main flavors we will discuss
By I, Weston.pace, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2463085οΏ½https://www.vectorstock.com/royalty-free-vector/neural-network-vector-18470587
How do you predict y from x?
32
x
y
sea surface temperature
convection
occurrence
Linear regression model
Model: (a, b)
Find (a, b) that minimizes the RMSE.
33
Features
Labels
Model
the architecture:
y = ax + b
With the assumption (or, architecture) that the model is a 1-degree polynomial, we describe this model as:
(a, b) = (0.5, 2)
x
y
2
2
Loss Function
Root-mean-square error
Optimization
We always need a metric to define how βgoodβ a model is.
Nonlinear regression model
34
x
y
2
2
Features
Labels
Nonlinear regression model
35
Model
y = ax2 + bx + c
With the architecture that the model is a 2-degree polynomial:
Minimize the loss function: RMSE
We could find the model: (a, b, c) = (-0.2, 2, 0)
x
y
2
2
Features
Labels
Classification-like model
36
x
y
Features
2
2
With this architecture, the model is:
Labels
Model
x | y |
0.0 < x < 3.0 | A |
3.0 < x < 4.1 | B |
4.1 < x < 5.6 | C |
5.6 < x < 9.7 | D |
A
B
C
D
Selection of architectures
37
FACT:
Perfect Learning is impossible.
(Since not βallβ the data is available.)
x
y
2
2
Which model is right?
Neither.
Which architecture should be used?
It depends.
model #1
model #2
1) have a good quality of the data set
2) wisely choose architectures according to data properties
3) wisely use your data
increase the possibility for us to find good models (based on your purpose)!
Machine Learning Approach
38
Selecting Predictors
Question: Will Marybeth catch the bus to campus?
All possible factors
Output: Catch the bus?
Itβs complicated!
Machine Learning Approach
39
Selecting Predictors
Question: Will Marybeth catch the bus to campus?
Inputs/Predictors:
Truth/Predictand:
All possible factors
Output: Catch the bus?
Itβs complicated!
Machine Learning Approach
40
Selecting Predictors
Past Data: Will Marybeth catch the bus to campus?
Marybeth caught the bus to campus 108 of 150 times that it was sunny.
Marybeth never caught the bus when she woke up after 9:00am (sample size of 780 days).
On the one day the sunrise was neon green, Marybeth caught the bus!
Wild, right? Wouldnβt anyone wake up early to watch a neon green sunrise?
All possible factors
Output: Catch the bus?
Itβs complicated!
Machine Learning Approach
41
Selecting Predictors
Past Data: Will Marybeth catch the bus to campus?
Marybeth caught the bus to campus 108 of 150 times that it was sunny.
Marybeth never caught the bus when she woke up after 9:00am (sample size of 780 days).
On the one day the sunrise was neon green, Marybeth caught the bus!
All possible factors
Output: Catch the bus?
Itβs complicated!
Should we use all possible factors in machine learning?
Machine Learning Approach
42
Selecting Predictors
Question: Will Marybeth catch the bus to campus?
Inputs/Predictors:
Truth/Predictand:
ML Method
Predictors
Output: Catch the bus?
Machine Learning Approach
43
Data Splitting is a very important aspect of ML design
Machine Learning Approach
44
Data Splitting is a very important aspect of ML design
Data split into 3 parts: training, validation, testing
Why?
*Often need to standardize your data!
Machine Learning Approach
45
Training Data
Data subset used to fit the ML model; data the model uses to learn and optimize (i.e. minimize loss)
Validation Data
Data subset used to tune model hyperparameters via unbiased evaluation of model fit
Testing Data
Data subset used to evaluate the final ML model on data not seen before
How to Split Data
46
Random Splitting of Data
How to Split Data
47
Random Splitting of Data
Split full dataset into 80% training and 20% testing
Split training subset into 75% training and 25% validation
How to Split Data
48
Climate data is often autocorrelated, so we split data chronologically
Example: Full dataset from 1900-2000
A L L
THE
DATA
Machine Learning Approach
49
Groups of Data
A
LοΏ½L
TοΏ½HοΏ½E
DοΏ½AοΏ½TοΏ½A
Training Data
Validation Data
Testing Data
x
y
2
2
training set
testing set
solution represents training data and testing data well
validation set
Machine Learning Approach
50
Groups of Data
A
LοΏ½L
TοΏ½HοΏ½E
DοΏ½AοΏ½TοΏ½A
Training Data
Validation Data
Testing Data
x
y
2
2
training set
testing set
Overfitting
too closely fitting the training data such that the model will fail on unseen data of the same type
this is a perfect model for the training data, but a very poor model for our testing data
validation set
Machine Learning Approach
Training is an iterative process.
Learning happens by training on past data.
Each iteration (epoch) of training, the machine learning model should better fit the testing data.
At the beginning of training, the machine learning model has no skill.
51
Training the Model
Machine Learning Approach
Training is an iterative process.
Learning happens by training on past data.
Each iteration (epoch) of training, the machine learning model should better fit the testing data.
At the beginning of training, the machine learning model has no skill.
52
Training the Model
Interviewer: Whatβs your biggest strength?
Me: Machine Learning.
Interviewer: Whatβs 9 + 6?
Me: 0.
Interviewer: Incorrect. Itβs 15.
Me: Itβs 15.
Interviewer: Whatβs 20 + 4?
Me: Itβs 15.
*** continues for 1000 epochs ***
Machine Learning Approach
Training is an iterative process.
Learning happens by training on past data.
Each iteration (epoch) of training, the machine learning model should better fit the testing data.
At the beginning of training, the machine learning model has no skill.
53
Training the Model
Interviewer: Whatβs your biggest strength?
Me: Machine Learning.
Interviewer: Whatβs 9 + 6?
Me: 0.
Interviewer: Incorrect. Itβs 15.
Me: Itβs 15.
Interviewer: Whatβs 20 + 4?
Me: Itβs 15.
*** continues for 1000 epochs ***
Initially no skill
Model updates
Based on past data
Machine Learning Approach
Training data is put into the machine learning modelβs learned function.
The model outputs its predictions. These predictions are compared with their truth values.
The function is updated to decrease the error between predicted and truth outputs.
This cycle continues.
54
Training the Model
xi
training data
(predictors)
g
learned function
Ε·i
predicted outputs
(predictands)
Ε·i
predicted outputs
yi
truth outputs
vs
1 epoch
Some terminology
WHAT WE CALL IT
Dependent Variable/Right Answer
Variable/Predictor
Slopes/Regression Coefficients
Y-intercepts/Constant factor
WHAT ML CALLS IT
Label
Feature
Weights
Bias
55
Summary
56
What is ML?
The Practical Procedure of ML
Vocab
Decision Trees and Random Forests
A brief overview
57
Image generated by Wombo AI
Meet Atlas!
58
Will Atlas want to play outside?
59
Playful!
or
Sleepy π€
Goal: design questions to classify events so predictions with new data are accurate
60
Observed data/events:
Will Atlas want to play outside?
Predictors:
Temperature, tiredness, boredom, hunger, fur length, snow, etc.
Output
Yes! Atlas is playful
No! Atlas is sleepy
Forest icon made by Freepik from flaticon.com
A decision tree is a series of questionsοΏ½Is temperature above 80?
Is it snowy?
Did he play yesterday?
Will Atlas want to play outside?
61
Is the temperature below 80 ΛF?
Yes!
No
Sleepy π€
Will Atlas want to play outside?
62
Yes!
Is the temperature below 80 ΛF?
Is it snowing?
Yes!
No
Sleepy π€
Playful!
No
Will Atlas want to play outside?
63
Yes!
Is the temperature below 80 ΛF?
Is it snowing?
Did he play outside yesterday?
Yes!
No
Sleepy π€
Playful!
No
No
Playful!
Yes!
β¦
What is a decision tree?
An intuitive supervised learning method for classification (discrete) and regression (continuous variable) tasks
Ask a series of questions to discriminate labels (i.e., classifications)
64
Courtesy: Python Data Science Handbook
Increasing depth of tree β
Decision tree structure
65
Root node
Leaf
Branch node
Branch
Tree depth
Nodes are built based on features (predictors) that best describe the classes/labels
shapeofdata.wordpress.com/2013/07/09/random-forests/
Splitting and stopping nodes
Splitting a node: Child nodes are constructed from additional features that best separate the classes/labels through the depth of the tree
Stopping at leaf node: Once a stopping criteria is reached (e.g., number of samples at the node, purity), branch is complete
66
There are three common gain metrics
Maximize information gain = minimizing impurity:
Greedy algorithm: Make the locally optimal decision at each node
67
More impurity
More purity
Advantages of Decision Trees
68
codeproject.com/Articles/4047359/Step-by-Step-Guide-To-Implement-Machine-Learning-2
Disadvantages of Decision Trees
69
Reduce overfitting and improve predictions
70
training
validation
Contiguous chunks
Random forests are decision tree ensembles!
71
Forest icon made by Freepik from flaticon.com
Observed data/events:
Other meteorological data
Predictors:
Temperature, humidity, insolation, wind speed, wind direction, precipitation, date of year, etc.
Output
Ozone class (good, fair, poor)
Each tree in the forest is a series of questionsοΏ½Is temperature above 80?
Is the wind from the west?
Is it raining?
Relative frequency of a label across all trees describes probabilistic occurrence (i.e. forecast)
What is a random forest?
-An ensemble of unique decision trees that become uncorrelated by selecting random subsets of features at each node
-Bagging instead of boosting
-Random sampling of training data
-Aggregated decision trees can significantly decrease the prediction variance with small increases to model bias (Herman and Schumacher 2016)
72
medium.com/@shvbajpai/how-to-master-python-for-machine-learning-from-scratch-a-step-by-step-tutorial-8c6569895cb0
Random Forest Development
Overfitting can still occur, particularly if a small subset of features are great predictors
Bagging, aka bootstrapping:
Unique decision trees that are then uncorrelated
73
Data
Sub samples
Decision trees
B1
B2
BN
...
...
...
Random Forest Tunable Parameters
This is not an exhaustive list but some of the most notable in our experience!
74
Why are RFs good for the job?
75
Where are they used?
76
Where are they used?
77
Where are they used?
78
DOI: 10.1175/MWR-D-17-0250.1
Now we will play with sample code!
Estimate ozone air quality class at Joshua Tree National Park using meteorological data
Try to beat the baseline! Validation weighted categorical accuracy = 0.537
Link: rf_ozone_joshuatree
79
Artificial Neural Networks (ANN)
βItβs all connectedβ
80
ANNs allow us to utilize a lot of information easily
81
data/input
prediction/output
Nonlinear function
ANN
?
So how does an ANN work?
ANNs consists of multiple nodes (the circles) and layers that are all connected and using basic math gives out a result.
These are called feed forward networks!
82
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6
So what happens in the hidden layers?
In each individual node the values coming in are weighted and summed together and bias term is added and activation.
Linear regression with non-linear mapping by an βactivation functionβ
83
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6
Activation function?
Activation function determines, if information is moving forward from that specific node.
This is the step that allows for nonlinearity in these algorithms, without activation all we would be doing is linear algebra!
84
So how do we train a network?
So training of the network is merely determining the weights βwβ and bias/offset βb" with the addition of nonlinear activation function.
85
Goal is to determine the best function so that the output is as correct as possible; typically involves choosing βweightsβ
How do we define βbestβ function?
This is where domain scientist can be very helpful!
You know the data and the goal youβre working towards, so you know the best, which loss function to use.
Basic MSE or MAE works well for regression tasks!
86
Cost/Loss
Function
Find a quantity you want to minimize (the βlossβ) to help determine the weights
87
loss/cost
ANN prediction
truth/actual
How do we find the minima in real life?
Letβs start with an easy linear example
88
An example loss function (MSE)
Gradient Descent
89
When hiking in Colorado, if the path up isnβt clear you choose the steepest incline, and you will find the top. Gradient descent is the same principle but reversed.
An example loss function (MSE)
Is a technique to find the weight that minimizes the loss function.
This is done by starting with a random point, the gradient (the black lines) is calculated at that point. Then the negative of that gradient is followed to the next point and so on. This is repeated until the minimum is reached.
https://www.ibm.com/cloud/learn/gradient-descent
Gradient descent mathematically
90
So, letβs think our example loss function J. The gradient descent formula tells us that the next location depends on the negative gradient of J multiplied by the learning rate π.
As the RMSE (our loss function) depends on the linear function and its weights a0 and a1, the gradient is calculated as partial derivatives with relation to the weights
https://www.ibm.com/cloud/learn/gradient-descent
Partial derivatives for RMSE (based on linear regression)
91
https://www.ibm.com/cloud/learn/gradient-descent
Gradient descent mathematically
After taking the derivatives, the rest is easy!
92
The only other thing one must pay attention to is the learning rate π (how big of a step to take). Too small and finding the right weights takes forever, too big and you might miss the minimum.
https://www.ibm.com/cloud/learn/gradient-descent
Code it up!
93
Code it up!
94
# of iterations through the data
fit gets better with number of epochs
Code it up!
95
# of iterations through the data
You can do this for polynomials too!
fit gets better with number of epochs
So far so good - this all looks super easy! Whatβs the catch?
We calculated the gradients by hand because we knew the functional form we wanted to fit (a polynomial). οΏ½οΏ½οΏ½οΏ½
96
So far so good - this all looks super easy! Whatβs the catch?
We calculated the gradients by hand because we knew the functional form we wanted to fit (a polynomial). οΏ½οΏ½οΏ½οΏ½But the whole point of an ANN is that you do not need to assume the optimal functional form of the equation ahead of time - itβs probably really complex!
So what do we do now�
97
We train the model!
98
Backpropagation
99
What we need to know!
The partial derivative of the PREDICTED y with respect to the weights w.
Backpropagation
100
computational graph
Relies on chain rule
Backpropagation
βby handβ
101
We encourage practicing backpropagation with pen and paper to help with visualizing the βblack boxβ of ML calculations
want this
You hopefully never οΏ½have to do that again!
Hurray for computers!
102
Great! Now you know how an ANN works!
Wellβ¦
Next we will talk about ANN architecture and decisions related to how to design an ANN
103
Some terminology
WHAT ML CALLS IT
Number of samples: number of individual realizations you have
Batch size: number of samples in a βchunkβ of training data that is iterated before updating the weights and biases
Epoch: number of times you train on all batches (number of times you go through your entire training set)
Deep Learning: training an ANN with two or more layers
104
Choices you need to make...
Feed-forward ANNs require choices by the user:
105
Lions, and Tigers and Parameters⦠Oh My!
While this may seem like a lot of parameters, we deal with parameter choices every dayβ¦
ANN considerations
106
Train the data the right number of times (epochs).
If we iterate too few times, our model will not have time to find the optimal fit. If we iterate too many times, we will overfit the training data.
Choosing the right number of layers and nodes for the given job is crucial.
Fewer layers/nodes allow for easier interpretability. More layers/nodes allow for more nonlinearity.
Choose an activation function most appropriate for your solution.
ReLu is a popular option.
Network Architectures
The way we put the ingredients, or pieces, together is called our network βarchitectureβ.
107
Regression Problem
Network Architectures
The way we put the ingredients, or pieces, together is called our network βarchitectureβ.
108
Classification Problem
https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax
Softmax: Letβs make the output a probability!
109
https://vitalflux.com/what-softmax-function-why-needed-machine-learning/
Gradient Descent in ML, Learning rate
Identify the right learning rate.
A learning rate too large may cause the solution to skip between local minima.
A learning rate too small may get stuck in one local minimum, and also may take a longer to learn.
110
https://www.jeremyjordan.me/nn-learning-rate/
Gradient Descent in ML, Learning rate
Identify the right learning rate.
A learning rate too large may cause the solution to skip between local minima.
A learning rate too small may get stuck in one local minimum, and also may take a longer to learn.
111
https://cs231n.github.io/neural-networks-3/
Gradient Descent in ML, choices
112
Figure by Alec Radford
Gradient Descent in ML, choices
113
https://www.ibm.com/cloud/learn/gradient-descent
Letβs Talk About Overfitting...
An ANN often has thousands (or millions) of weights to adjust
114
inputs
hiddenοΏ½layer 1
hiddenοΏ½layer 2
two outputs
each connection = a weight that must be βlearnedβ
Letβs Talk About Overfitting...
An ANN often has thousands (or millions) of weights to adjust
This often leads to egregious overfitting
Two common remedies:
115
The black line fits the data well, whereas the green one is
overfitting.
https://elitedatascience.com/overfitting-in-machine-learning
Regularization:
L2 aka Ridge Regression
A regression tool to help prevent overfitting.
Adds a term to the loss/error function that penalizes the weights if they are too large.
Keeps the weights small.
116
Ridge Regression (βL2 Regularizationβ)
Shrinkage TermοΏ½Forces the individual coefficients to be small
>> sharing of weight across coefficients
Ridge ParameterοΏ½How much to penalize large weights
will force coefficients to share importance
Regularization:
L1 aka LASSO Regression
A regression tool to help prevent overfitting.
Adds a term to the loss/error function that penalizes the weights if they are too large.
Keeps the weights small.
117
LASSO Regression (βL1 Regularizationβ)οΏ½(least absolute shrinkage and selection operator)
Shrinkage TermοΏ½Punishes high values but actually sets them to exactly zero if not important
>> helpful for determining which features are most important
LASSO ParameterοΏ½How much to penalize large weights
will force some coefficients to zero
/
/
/
/
/
/
0
0
0
0
0
0
Dropout
During training (and training only!), randomly remove or βdrop outβ nodes, requiring that the ANN learn to still make accurate predictions in spite of losing these nodes. Dropout has the effect of forcing nodes within a layer to probabilistically take on more or less responsibility for the inputs.
118
At last, testing
Training an ANN is all about updating the weights and biases.
This is done iteratively by comparing to the βtruthβ.
Once this is done and you are happy with the performance of your model, itβs time for testing!
Testing then involves freezing the weights, and using the ANN as a predictor function.
119
ANN Coding Examples
Two examples
120
Predicting a Sine Wave with an ANN
121
We are going to train an ANN to predict a sine wave
The function we are predicting is the solid line, and the points we use indicated in purple
Predicting a Sine Wave with an ANN
122
Now open the code in google colab, and find the cell titled Set some neural network parameters
Predicting a Sine Wave with an ANN
123
Now open the code in google colab, and find the cell titled Set some neural network parameters
This is where we set our architecture and hyper-parameters. To start with, enter the following values
lr = 0.01
batch_size = 32
n_epochs = 400
activation = βsigmoidβ
hiddens = [10,100]
loss = βmaeβ
Predicting a Sine Wave with an ANN
124
Run the code!
Predicting a Sine Wave with an ANN
125
How does the model do?
Predicting a Sine Wave with an ANN
126
How does the model do?
Predicting a Sine Wave with an ANN
127
How does the model do?
Predicting a Sine Wave with an ANN
128
Play with the network parameters
Predicting a Sine Wave with an ANN
129
A fun example of the effect of auto-correlation between training validation and testing
Predicting a Sine Wave with an ANN
130
A fun example of the effect of auto-correlation between training validation and testing
Predicting a Sine Wave with an ANN
131
A fun example of the effect of auto-correlation between training validation and testing
Change this so that
randomselect = 0
Predicting a Sine Wave with an ANN
132
A fun example of the effect of auto-correlation between training validation and testing
lr = 0.01
batch_size = 36
n_epochs = 400
activation = βsigmoidβ
hiddens = [10,100]
loss = βmaeβ
Predicting a Sine Wave with an ANN
133
A fun example of the effect of auto-correlation between training validation and testing
lr = 0.01
batch_size = 36
n_epochs = 400
activation = βsigmoidβ
hiddens = [10,100]
loss = βmaeβ
Predicting a Sine Wave with an ANN
134
A fun example of the effect of auto-correlation between training validation and testing
lr = 0.01
batch_size = 36
n_epochs = 400
activation = βreluβ
hiddens = [10,100]
loss = βmaeβ
Predicting a Sine Wave with an ANN
135
A fun example of the effect of auto-correlation between training validation and testing
lr = 0.01
batch_size = 36
n_epochs = 400
activation = βreluβ
hiddens = [10,100]
loss = βmaeβ
Predicting ENSO with an ANN
Another example we know the answer to.
136
Defining ENSO...
137
Toms et al., JAMES (2020)
Figure courtesy of NCAR Climate Data Guide
NiΓ±o 3.4 index
ENSO is commonly defined according to average sea-surface temperatures within the central tropical Pacific.
ENSO + Neural Networks
138
Toms et al., JAMES (2020)
Predicting ENSO with an ANN
139
Include only samples where nino event is occuring (i.e. nino3.4>0.5 or nino3.4<-0.5)
Split into train, validation, and test sets by date:
Predicting ENSO with an ANN
140
Now open the code in google colab, and find the cell titled Set some neural network parameters
Once, you have set the parameters, run the code!
To start with, use the following values:
hiddens = [12, 12]
ridgepen = 1
lr = 1e-3
n_epochs = 20
batch_size = 32
activation = βreluβ
loss = βcategorical_crossentropyβ
Predicting ENSO with an ANN
141
How does the model do?
Predicting ENSO with an ANN
142
How does the model do on test data?
Predicting ENSO with an ANN
143
How does the model do on test data?
Confusion matrix:
Predicting ENSO with an ANN
144
How does the model do on test data?
Predicting ENSO with an ANN
145
Now go back to the neural network parameters.
Re-train the network using different parameters.
How do the predictions change?
Advanced ANN Techniques
146
Convolutional NN
Cat or Dog?
147
Convolutional Layer
Benefits:
Convolutional NN
Cat or Dog?
148
It has cat ears - itβs a cat
Convolutional Layer
Convolutional NN
Used for image detection
Initial Training:
β’Edges, Corners, Shapes (circles, squares)
More Training:
β’Objects (Ears, Eyes)
Even More Training:
β’Complex objects (cats, dogs)
149
Filters
150
Filters
151
Convolving
152
Convolving
153
CNN layer for a single multi-channel kernel
Perform convolution across channels
Sum across channels
Add bias term
Zero padding in action
Connecting CNNs to Fully Connected Networks
156
Example courtesy of Imme Ebert-Uphoff
Pooling Layer
157
Seasonal ENSO prediction
158
Task: Predict Nino3.4 index __ months into the future using maps of the ocean state
Figure from Ham et al. (2019)
Advantages of CNNs
159
Satellite Image -> Synthetic Radar Image
https://journals.ametsoc.org/view/journals/apme/60/1/jamc-d-20-0084.1.xml
Kyle Hilburn
[CSU/CIRA]
U-Net Architecture
https://journals.ametsoc.org/view/journals/apme/60/1/jamc-d-20-0084.1.xml
Kyle Hilburn
[CSU/CIRA]
Inputs + Labels
https://journals.ametsoc.org/view/journals/apme/60/1/jamc-d-20-0084.1.xml
Kyle Hilburn
[CSU/CIRA]
Inputs + Labels
https://journals.ametsoc.org/view/journals/apme/60/1/jamc-d-20-0084.1.xml
Kyle Hilburn
[CSU/CIRA]
Results!
https://journals.ametsoc.org/view/journals/apme/60/1/jamc-d-20-0084.1.xml
Kyle Hilburn
[CSU/CIRA]
Truth
CNN Prediction
Image Segmentation with a CNN
Before trainingβ¦
After trainingβ¦
ClimateNet - using experts to label
ClimateNet - using experts to label
https://gmd.copernicus.org/articles/14/107/2021/
Image Segmentation w/ ClimateNet
https://gmd.copernicus.org/articles/14/107/2021/
Image Segmentation w/ ClimateNet
https://gmd.copernicus.org/articles/14/107/2021/
Truth (Labeled)
Predicted by Trained CNN οΏ½Trained on ClimateNet
Autoencoders
170
https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798
Even more advanced techniques
Autoencoders
171
https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798
Even more advanced techniques
If your activation function is linear, this becomes the PCs of the leading EOFs
Generative Adversarial Networks
172
Figure by Atienza, Rowel. Advanced Deep Learning with Keras: Apply deep learning techniques, autoencoders, GANs, variational autoencoders, deep reinforcement learning, policy gradients, and more. Packt Publishing Ltd, 2018.
Even more advanced techniques
Visualizing Climate Change & Associated Hazards
Style GANs generated these faces. They are not real.
Cycle GAN
Cycle GAN
GANs for Nowcasting
Fig. 1: Model overview and case study of performance on a challenging precipitation event starting on = 24 June 2019 at 16:15 UK, showing convective cells over eastern Scotland.
Instructions for installing Tensorflow on an Apple M1 Chip
178
https://makeoptim.com/en/deep-learning/tensorflow-metal
https://developer.apple.com/metal/tensorflow-plugin/
https://github.com/apple/tensorflow_macos
https://github.com/conda-forge/miniforge
In a nutshell:
β---------
Ethical Use of AI in Earth Science
179
A fun motivational story
180
Sims (1994 ) https://www.karlsims.com/papers/alife94.pdf
A fun motivational story
181
Lehman et al. (2020) https://doi.org/10.1162/artl_a_00319
Using AI to hire new workers
Goal: Feed an AI all resumes and have the AI return the top candidates
182
Using AI to hire new workers
Goal: Feed an AI all resumes and have the AI return the top candidates
Problem: AI had a gender bias no matter how it was trained
183
AI Ethics: Using AI methods in an ethical manner as to not cause harm to others. Accomplished through careful thought in data processing, AI development, and deployment.
184
185
186
Example of biased data in Earth science
Radar coverage is unequal
187
McGovern et al. 2021 https://doi.org/10.1175/BAMS-D-18-0195.1
Bias can be introduced unintentionally!
188
Chase and McGovern (2022) βDeep Learning Parameter Considerations When Using Radar and Satellite Measurementsβ
What can you do?
189
Implementation and Assessment of an ANN
190
Different regularization techniques
191
L2 Regularization: an example
192
Barnes et al. (2020): https://doi.org/10.1029/2020MS002195
Performance Measures
Confusion Matrix: summary of prediction results
193
*Note- these are only used for classification problems!
Assessment Scores:
Precision = TP / (TP + FP); (between 0 and 1)
Accuracy of positive predictions
Recall = TP/ (TP + FN); (between 0 and 1)
Ratio of positive instances correctly detected by classifier
F1 score: harmonic mean of precision and recall; (between 0 and 1)
Performance Measures
Receiver Operating Characteristic (ROC)
194
Performance Measures
Receiver Operating Characteristic (ROC)
195
Assessment Scores: measure the Area Under the Curve (AUC)
*remember, you (the scientist) decides what better means!
Performance Measures
Loss & Accuracy Assessment: User- Defined Loss Function
Overfit Model: training outperforms validation
196
Performance Measures
Loss & Accuracy Assessment: User- Defined Loss Function
Overfit Model: training outperforms validation
Generalized Model! training matches validation
197
Permutation Importance
198
A way to detect features that are more important for accurate predictions, Brieman (2001)
For each feature, the feature values are shuffled, and new predictions made (with the shuffled feature values included).
If the model error increases after the shuffling, this feature is considered important. If there is no change in the model error, the feature in question adds no value to the model prediction.
https://christophm.github.io/interpretable-ml-book/feature-importance.html
Permutation Importance
199
What if x is highly correlated with another predictor? β Multi-pass permutation
Freeze initially permuted predictor that degrades model performance the most, and re-permute other predictors to select second frozen predictor, and continue until all features are permuted and frozen
In comparison to the baseline model skill, can determine which predictors impact skill the most
Rank predictors based on error increase after permutation
Lakshmanan et al. (2015)
McGovern et al. (2019)
Time to code!
https://github.com/eabarnes1010/ml_tutorial_csu
From Github, open ann_ozone_joshuatree_metrics.ipynb
200
Methods of Explainable AI (XAI) for ANNs
201
Overview
1) Introduction to XAI:
i) Motivation for XAI
ii) The general idea of how XAI works
iii) Opportunities that XAI brings
iv) Representative methods and categories of XAI
2) Popular XAI methods:
i) Gradient
ii) Input*Gradient
iii) Layer-wise Relevance Propagation
ii) SHAP β SHapley Additive exPlanations
3) Benchmarking XAI:
i) Motivation - General idea
ii) Examples
4) Summary
202
Introduction to XAI
Introduction to XAI
Why is XAI necessary?
204
Scientists need to understand what the AI model is doing; what the decision-making process is.
Linear model: inherently interpretable
Neural Network: not inherently interpretable
Why is XAI necessary?
205
From Adebayo et al. (2020)
Methods of eXplainable Artificial Intelligence (XAI) aim to explain how a Neural Network makes predictions, i.e., what the decision strategy is.
XAI methods highlight which features in the input space are important for the prediction: They produce the so-called explanation/relevance heatmaps.
Explanation (or relevance) Heatmap
Why is XAI necessary?
206
Methods of eXplainable Artificial Intelligence (XAI) aim to explain how a Neural Network makes predictions, i.e., what the decision strategy is.
XAI methods highlight which features in the input space are important for the prediction: They produce the so-called explanation/relevance heatmaps.
tabby cat
white wolf
ram
black widow
Explanation (or relevance) Heatmap
Network Input
site:
Any questions?
207
XAI: A potential game changer for prediction in Earth Sciences
XAI may help accelerate establishing new science, like investigating new climate teleconnections and gaining new insights.
From Mamalakis et al. (2022)
XAI helps calibrate model trust and physically interpret the network, which is a necessity in many applications in Earth Sciences.
XAI may help fine-tune and optimize the strategy of a flawed model
208
XAI methods and categories
AI models
Interpretable models
Post-hoc Explainable models
Global XAI methods
Local XAI methods
sensitivity
attribution
(e.g., optimal input, permutation importance)
(e.g., Gradient)
(e.g., LRP, SHAP)
(e.g., linear models, decision trees)
From Samek et al. (2021)
. . .
Introduction to XAI
Popular XAI Methods
210
Gradient (sensitivity)
Network
Value of feature i in sample n
Relevance of feature i for prediction n
Partial derivative
211
Β
Analogy in the simple pendulum setting
Β
Ballβs velocity
Β
Β
Ballβs acceleration
Input*Gradient (attribution)
Gradient
Relevance of feature i for prediction n
Input
212
Β
LRPz :
Other popular LRP rules :
Β
Β
Β
From Bach et al., (2015)
LRP: Layerwise Relevance Propagation (attribution)
Relevance of neuron i in layer l
Preactivation from i to j
Any questions?
213
Consider the general class of explanation models:
Any XAI method that can be represented as in Eq. (1), we will say it is an additive feature attribution method.
In other words, the best solution to (1) is to use Shapley values; LRP is not the best solution to (1).
LRP and other popular XAI methods (e.g., LIME, DeepLIFT) are essentially different solutions to Eq. (1).
From Lundberg and Lee (2017)
SHAP: SHapley Additive exPlanations (attribution)
network
attribution to feature i
input
214
prediction
Β
Β
Β
Β
Β
Β
SHAP (attribution)
215
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
SHAP (attribution)
216
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
Β
SHAP (attribution)
Β
Β
217
Β
Β
SHAP (attribution)
For problems with high dimensions (e.g., more than 10 features), the number of times one would need to retrain and evaluate the model in order to calculate the shapley values is extremely high. Computationally, this is not efficient, so the SHAP method uses an approximate algorithm (Deep SHAP), specifically designed for deep neural networks. Deep SHAP is similar to LRP, except that instead of propagating the relevance, it propagate the Shapley values.
Also, there is no model retraining in Deep SHAP. When a specific feature (or neuron output) needs to be withheld (i.e., to be considered missing), it is replaced with a background value, which is usually the average value in the training dataset.
218
SHAP Example on MNIST
CNN
Class 0
Class 1
Class 2
.
.
.
Class 8
Class 9
Input
Class 0
Class 1
Class 2
Class 3
Class 4
Class 5
Class 6
Class 7
Class 8
Class 9
Any questions?
Introduction to XAI
Benchmarking XAI
220
XAI method
The need for objectivity in assessing XAI
Issues : 1) No ground truth to assess the estimated explanations.
Which input features were important for this classification?
XAI heatmap
Debagging the phrase: βThe explanation looks reasonableβ
Remember: The human perception of the explanation alone is NOT a solid criterion for its trustworthiness.
From Adebayo et al. (2020)
221
Issues : 1) No ground truth to assess the estimated explanations.
2) Different methods provide different answers.
This is problematic: The uncertainty on how the network decides, leads to limited trust when using neural networks in environmental problems.
Many Different XAI methods
We need objective frameworks to rigorously assess XAI methods and gain insights about relative strengths and weaknesses.
From Adebayo et al. (2020)
The need for objectivity in assessing XAI
Which input features were important for this classification?
222
From Mamalakis et al. (2021)
Attribution benchmarks for XAI
223
From Mamalakis et al. (2021)
Regression Benchmark - Fully Connected Network
Any questions?
224
From Mamalakis et al. (2022)
Classification Benchmark - Convolutional Network
225
From Mamalakis et al. (2022)
Classification Benchmark - Convolutional Network
Best practices of XAI
226
From Mamalakis et al. (2022)
Introduction to XAI
Summary
Key take home messages
References
229
J. Adebayo et al. (2020) βSanity checks for saliency maps,β arXiv preprint, https://arxiv.org/abs/1810.03292.
Samek, et al. (2021), βExplaining Deep Neural Networks and Beyond: A review of Methods and Applicationsβ, inΒ Proceedings of the IEEE, vol. 109, no. 3, pp. 247-278, March 2021, doi: 10.1109/JPROC.2021.3060483
Bach, et al. (2015) βOn Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagationβ, PLOS ONE, https://doi.org/10.1371/journal.pone.0130140
Other Resources
INNVESTIGATE
SHAP
Mamalakis, A., I. Ebert-Uphoff and E.A. Barnes (2021) βNeural Network Attribution Methods for Problems in Geoscience: A Novel Synthetic Benchmark Datasetβ, arXiv preprint, https://arxiv.org/abs/2103.10005, accepted in Environmental Data Science.
Mamalakis, A., I. Ebert-Uphoff, E.A. Barnes βExplainable Artificial Intelligence in Meteorology and Climate Science: Model fine-tuning, calibrating trust and learning new science,β in Beyond explainable Artificial Intelligence by Holzinger et al. (Editors), Springer Lecture Notes on Artificial Intelligence, open access at: https://link.springer.com/chapter/10.1007/978-3-031-04083-2_16
Lundberg, S. M. and S. I. Lee (2017) βA unified approach to interpreting model predictions,β Proc. Adv. Neural Inf. Process. Syst., pp. 4768-4777.
My GitHub page
Mamalakis, A., E.A. Barnes, I. Ebert-Uphoff (2022) βInvestigating the fidelity of explainable artificial intelligence methods for applications of convolutional neural networks in geoscienceβ, arXiv preprint, https://arxiv.org/abs/2202.03407, accepted in Artificial Intelligence for the Earth Systems.
Simple Uncertainty Quantification
Can a network predict how right it is?
230
Uncertainty Quantification
231
For prediction problems in the Earth sciences, we usually want some sort of uncertainty or likelihood associated with the prediction.
E.g. in forecasting we generate ensembles
Image from https://www.metoffice.gov.uk/research/weather/ensemble-forecasting/decision-making
Uncertainty Quantification
232
For prediction problems in the Earth sciences, we usually want some sort of uncertainty or likelihood associated with the prediction.
E.g. in forecasting we generate ensembles
This allows us to quantify the likelihood of an event, as well as build trust in our models
Image from https://www.metoffice.gov.uk/research/weather/ensemble-forecasting/decision-making
Uncertainty Quantification in ML
233
We want to build uncertainty quantification into machine learning models for the same reasons:
There will be 2 cm of snow this weekend!
There will be 2 Β± 2 cm of snow this weekend!
Uncertainty Quantification in ML
234
We want to build uncertainty quantification into machine learning models for the same reasons:
Here we will go through one way of adding uncertainty to NNs but there are all sorts of other methods you may encounter
There will be 2 cm of snow this weekend!
There will be 2 Β± 2 cm of snow this weekend!
235
Input Layer
Hidden Layer(s)
Output Layer
Consider a simple regression taskβ¦
Predictands, xα΅’
Prediction, a specific number, yβ
236
Input Layer
Hidden Layer(s)
Output Layer
Consider a simple regression taskβ¦
Based on ocean heat informationβ¦
β¦predict SST anomaly at a point in the ocean
237
Consider a simple regression taskβ¦
In a regression model, we want to train a network to minimize the error between its predictions and the truth,
Loss function could be mean absolute error, or mean squared error, e.g.
β = |yβ - y|
And we can make a scatter plot of the truth (y) vs predictions (yβ)
yβ
y
238
Consider a simple regression taskβ¦
Obviously some predictions are better than others
yβ
y
Wouldnβt it be great if there was some way of knowing how good a prediction is
i.e. get the network to estimate an uncertainty range
Adding Uncertainty to Regression Tasks
Rather than the network outputting a single number as its estimateβ¦.
We want the network to output an estimate and an uncertainty range
239
yβ
π
prediction
associated uncertainty
Adding Uncertainty to Regression Tasks
Rather than the network outputting a single number as its estimateβ¦.
We want the network to output an estimate and an uncertainty range
240
yβ
π
We train the neural network to predict conditional distributions, i.e. each prediction is the parameters of a probability distribution
Predicted anomaly
Uncertainty range
Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250
Adding Uncertainty to Regression Tasks
Rather than the network outputting a single number as its estimateβ¦.
We want the network to output an estimate and an uncertainty range
241
π
The simplest version of this is predicting a Gaussian β so predicting a mean (π) and standard deviation (π)
π
Predicted anomaly
Uncertainty range
Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250
Adding Uncertainty to Regression Tasks
242
π
π
Predicted anomaly
e.g. 1.3
Uncertainty
e.g. 1.2
Predicted π and π are used to construct a normal distribution
Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250
Adding Uncertainty to Regression Tasks
243
Y true
π
π
Predicted anomaly
e.g. 1.3
Uncertainty
e.g. 1.2
Evaluate p, probability distribution function at the true value of the anomaly
p
Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250
Adding Uncertainty to Regression Tasks
244
Y true
π
π
Predicted anomaly
e.g. 1.3
Uncertainty
e.g. 1.2
Loss function is defined as
β = -log(p)
p
Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250
Adding Uncertainty to Regression Tasks
245
Y true
Loss function is defined as
β = -log(p)
To minimize loss, network attempts to maximise p
p
Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250
Adding Uncertainty to Regression Tasks
246
Loss function is defined as
β = -log(p)
To minimize loss, network attempts to maximise p
The network must hence learn to make good anomaly predictions, but also reasonable uncertainty estimates
Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250
Adding Uncertainty to Regression Tasks
247
Now predictions have an uncertainty range
Here we plot the 1π range associated with each prediction
Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250
A Fun and Important Point!
248
Loss function:
β = -log(p)
This is simply a log likelihood!
The predicted distribution does not have to be a Gaussian!!
Consider for example the sinh-arcsinh (SHASH) normal distributionβ¦
Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250
A Fun and Important Point!
249
The SHASH normal distribution
Examples of SHASH with tail weight set to 1 (π=1)
Jones and Pewsey: doi.org/10.1111/j.1740-9713.2019.01245.x
A Fun and Important Point!
250
The SHASH normal distribution
Note if skewness=0, then we have a normal distribution!
Jones and Pewsey: doi.org/10.1111/j.1740-9713.2019.01245.x
Using SHASH
251
Barnes, Barnes and Gordillo: https://arxiv.org/abs/2109.07250
Application:
Predicting SST on decadal timescales
252
Predicting SST on decadal timescales
253
π
OHC to 100 m
OHC to 300 m
OHC to 700 m
π
NEURAL NETWORK
(60,4)
INPUT LAYER
OUTPUT LAYER
Gordon and Barnes: essoar.org/doi/abs/10.1002/essoar.10510836.1
Predicting SST on decadal timescales
254
π
π
OUTPUT LAYER
Prediction of SST anomaly with uncertainty at some point in the ocean, 1-5 years later, e.g. North Atlantic Subpolar Gyre
53N, 35W
Identifying State-Dependent Predictability
255
The neural network identifies more predictability by assigning lower uncertainty values to input patterns that result in more predictable outcomes
For low uncertainty predictions, the ANN is more confident its prediction is closer to the truth
Prediction error (difference between truth and predicted anomaly) decreases as we sort by increasing confidence
More confident = lower uncertainty
What can we learn from this?
We train a neural network to predict SST in 1-5 years at every grid point in the ocean (i.e. one* neural network per grid point).
We can then compare the skill of our predictions across all samples in the testingβ¦
And we can leverage the uncertainty/π predictions to identify increased predictability
256
Resources if you want more UQ
257
Code! There is a coding example in the github to go through in your own time.
Papers! We have a quick explainer on this method
And I have a paper in review on adapting this in my own (Emily Gordon) research
Main Takeaways
END