1 of 26

Controlled Abstention Networks (CAN)

Prof. Elizabeth A. Barnes & Prof. Randal J. Barnes

presentation to AI2ES

May 2021

2 of 26

Controlled Abstention Networks (CAN)

The earth system is exceedingly complex and often chaotic in nature, making prediction incredibly challenging:

We cannot expect to make perfect predictions all of the time.

3 of 26

The abstention loss works by incorporating uncertainty in the network’s prediction to identify the more confident samples and abstain (say “I don’t know”) on the less confident samples.

...the abstention loss is applied during training to preferentially learn from the more confident samples.

4 of 26

Launching point

Had idea of abstention networks back in 2018 (i.e. IDK networks) but had no clue how to do it.

Never would have made it this far without the great PhD work of Dr. Sunil Thulasidasan!

Our results build off of the groundwork put down by his paper and dissertation for classification tasks

5 of 26

Manuscripts submitted + arXiv

6 of 26

General Idea

Estimate uncertainty of each prediction during training

Classification: simple - just use the softmax output
Regression: we need a way to predict uncertainty�

Loss function that learns to identify more confident predictions and learn them better

Classification: we introduce the NotWrong Loss
Regression: we introduce a modified negative log likelihood�

Compare to baseline methods that filter out samples post training

While the baseline methods perform very well, we find that the abstention method outperforms the baselines for a variety of tasks

7 of 26

Adding uncertainty to regression tasks

* write-up coming soon...

8 of 26

Uncertainty for regression tasks

Want a neural network output layer that makes a prediction and estimates the uncertainty

Barnes and Barnes (in prep)

9 of 26

Uncertainty for regression tasks

Want a neural network output layer that makes a prediction and estimates the uncertainty

Barnes and Barnes (in prep)

10 of 26

Uncertainty for regression tasks

Want a neural network output layer that makes a prediction and estimates the uncertainty
Predict the parameters of a known distribution/PDF (e.g. gaussian)
Use maximum likelihood estimation (i.e. negative log likelihood) as a loss function

Barnes and Barnes (2021)

Barnes and Barnes (in prep)

Powerful Baseline Approach

11 of 26

Adding abstention

12 of 26

Abstention During Training

Abstention loss is very similar for both classification and regression
Regression loss is a modified log loss, weighted by the “prediction weight” determined by the uncertainty sigma
An additional term penalizes abstention

13 of 26

Abstention During Training

Abstention loss is very similar for both classification and regression
Regression loss is a modified log loss, weighted by the “prediction weight” determined by the uncertainty sigma
An additional term penalizes abstention

prediction weight

baseline -log(p)

controls amount of abstention

data-specific scale

14 of 26

Abstention During Training

Abstention loss is very similar for both classification and regression
Regression loss is a modified log loss, weighted by the “prediction weight” determined by the uncertainty sigma
An additional term penalizes abstention
alpha: abstention fraction can be set by a PID controller or user can have network predict the best abstention fraction

prediction weight

baseline -log(p)

controls amount of abstention

data-specific scale

15 of 26

A simple 1D example

Barnes and Barnes (2021)

20% of the data

80% of the data

16 of 26

A simple 1D example

Barnes and Barnes (2021)

20% of the data

80% of the data

17 of 26

A simple 1D example

Barnes and Barnes (2021)

20% of the data

80% of the data

18 of 26

A simple 1D example

Barnes and Barnes (2021)

20% of the data

80% of the data

19 of 26

A more complex example

20 of 26

Synthetic Climate Data

Created by CSU postdoc Dr. Antonios Mamalakis
Each sample is one global map of “SSTs” computed from real-world spatial covariances
Different piecewise linear functions determine how all of the grid points (pixels) combine to give a single “y” label

Mamalakis, Ebert-Uphoff and Barnes (2021)

y = -.019

21 of 26

Synthetic Climate Data

Created by CSU postdoc Dr. Antonios Mamalakis
Each sample is one global map of “SSTs” computed from real-world spatial covariances
Different piecewise linear functions determine how all of the grid points (pixels) combine to give a single “y” label
Network Task: predict the value “y” for each sample

Mamalakis, Ebert-Uphoff and Barnes (2021)

Barnes and Barnes (2021)

output

layer

input

layer

hidden

layer

hidden

layer

y = -.019

22 of 26

Synthetic Climate Data

Created by CSU postdoc Dr. Antonios Mamalakis
Each sample is one global map of “SSTs” computed from real-world spatial covariances
Different piecewise linear functions determine how all of the grid points (pixels) combine to give a single “y” label
Network Task: predict the value “y” for each sample

EXPERIMENT: Forecasts of Opportunity

All ENSO+ samples (average ENSO region > 0.5) are untouched
100% of the other samples are corrupted = shuffled
29% untouched
71% corrupt

Mamalakis, Ebert-Uphoff and Barnes (2021)

Barnes and Barnes (2021)

output

layer

input

layer

hidden

layer

hidden

layer

y = -.019

Forecast of Opportunity when the average in this box is > 0.5

23 of 26

Abstention outperforms baseline

Train abstention network for different abstention setpoints

The best CAN models are always better (lower error) than the best baseline ANN

Barnes and Barnes (2021)

24 of 26

Abstention outperforms baseline

Train abstention network for different abstention setpoints

The best CAN models are always better (lower error) than the best baseline ANN

Barnes and Barnes (2021)

25 of 26

Abstention outperforms baseline

Train abstention network for different abstention setpoints

The best CAN models are always better (lower error) than the best baseline ANN

Specific labels corrupted (structured noise)

Shuffled sample labels

(arbitrary label noise)

Forecasts of opportunity

(skillful predictions)

Corrupted inputs

(input data cleaner)

CAN outperforms baseline networks

Barnes and Barnes (2021a)

Barnes and Barnes (2021b)

26 of 26

Take home ideas

Predicting the local parameters of a probability distribution is a simple way to add uncertainty for regression tasks �(we are preparing a short write-up generalizing this for heteroscedastic, asymmetric uncertainty)
The abstention loss for regression and classification allows networks to identify opportunities for skillful prediction, and then learn them better during training
Implementation of the abstention loss is straightforward in most network architectures

Prof. Elizabeth A. Barnes

eabarnes@colostate.edu

https://barnes.atmos.colostate.edu

@atmosbarnes

https://github.com/eabarnes1010/controlled_abstention_networks