2 of 53

What Pisa can teach us about Deep Learning

People tried to make a very tall tower

Turns out it’s not enough to stack one floor on the other if the foundations are not solid

It’s the same thing with Deep Learning:

Simply stacking many neurons is not enough!

3 of 53

Outline

A few things you may already know, motivated and trying to understand why we do them that way

Bias-variance tradeoff
Basic ML recipe
Training and Bias
Testing and Variance
Losses functions

4 of 53

Bias-variance tradeoff

Encountered in most statistical models

We have a bias when the model is too simple and it is not capturing the true relationship between x and y = h*(x)

e.g. a linear model cannot reproduce a quadratic one

Image credit: https://cs229.stanford.edu/main_notes.pdf

5 of 53

Bias-variance tradeoff

Encountered in most statistical models

When the model is too expressive, we risk fitting patterns of our small, finite training sample -- variance part of the error

e.g. a 5th-degree polynomial will overfit a small sample from a quadratic model

Image credit: https://cs229.stanford.edu/main_notes.pdf

6 of 53

Bias-variance tradeoff defines an optimal complexity

We must strike the right balance between a simple, highly biased model and a complex, variance sensitive one

Deep learning requires careful tuning and experimentation!

7 of 53

How do we address this? Basic ML recipe

Training dataset

How well I am modelling the process producing the training data

Checks for bias

Test dataset

How well we generalize to previously unseen instances of the data

Checks for variance

Deploy!

8 of 53

How do we address this? Basic ML recipe

Training dataset

How well I am modelling the process producing the training data

Checks for bias

Test dataset

How well we generalize to previously unseen instances of the data

Checks for variance

Deploy!

9 of 53

How do we address this? Basic ML recipe

Training dataset

How well I am modelling the process producing the training data

Checks for bias

Test dataset

How well we generalize to previously unseen instances of the data

Checks for variance

Deploy!

10 of 53

Data matters!!

Image credit: https://www.benchling.com/blog/building-a-strong-data-foundation-to-get-machine-learning-and-automation-right

11 of 53

Data matters!!

Image credit: https://www.benchling.com/blog/building-a-strong-data-foundation-to-get-machine-learning-and-automation-right

12 of 53

Scale drives deep learning progress!

Assuming:

You can fit the training set pretty well.
The training set performance generalizes pretty well to the test set.

13 of 53

Fortunately, we have an abundance of data!

14 of 53

We still need to be careful in how we handle them

Understand your data

Why is it relevant to your problem?
Is some important physical information missing?
Is the data correctly labelled?
Is the data introducing some unwanted correlations?
Is the data still relevant to my problem?

15 of 53

The art of Training

What can I do if the training performance is lacking?

It means that we are not learning the underlying statistical model

Epochs

Loss

Image credit: https://www.javatpoint.com/overfitting-in-machine-learning

16 of 53

The art of Training

What can I do if the training performance is lacking?

It means that we are not learning the underlying statistical model:

Train longer
Bigger network (more capacity)
Gradients spaces and learning algorithms
Vanishing/exploding gradients

17 of 53

Moving around the loss space can be tricky!

Image credit https://www.cs.umd.edu/~tomg/projects/landscapes

18 of 53

Basic gradient descent is limited

The naive stochastic/batch gradient descent has many limitations in how it navigates complex loss spaces:

sensitive to noise in current sample/batch
easy to get trapped in local minimum

https://optimization.cbe.cornell.edu/index.php?title=Momentum

19 of 53

Momentum helps overcoming these limitations

Momentum is an extension to the algorithm that builds inertia in a search direction to overcome local minima and oscillation of noisy gradients.

https://optimization.cbe.cornell.edu/index.php?title=Momentum

20 of 53

State-of-the-art: Adaptive Moment Estimation (Adam)

Adam is an adaptive learning rate algorithm:

It uses momentum
It dynamically adjusts the learning rate for each individual parameter within a model, rather than using a single global learning rate

21 of 53

Vanishing/Exploding gradients

Remember that the gradients of the loss depend on:

the derivative of the activation functions

the derivative of the outputs (i.e. the weights)

Image credits: https://towardsdatascience.com/neural-networks-backpropagation-by-dr-lihi-gur-arie-27be67d8fdce

22 of 53

The gradients can vanish for small derivatives

Some choices of activation functions have small derivatives

This can lead to chain multiplication of small numbers!

Image 1 credits https://dustinstansbury.github.io/theclevermachine/derivation-common-neural-network-activation-functions

23 of 53

The gradients can vanish for small derivatives

This can lead to chain multiplication of small numbers, making the gradients of the initial layers effectively 0 and preventing learning

Mitigated through initialization and change of activation functions

Image 2 credits https://www.jefkine.com/general/2018/05/21/2018-05-21-vanishing-and-exploding-gradient-problems/

24 of 53

The gradients can explode for large weights

The weights can have a norm >>1

Multiplication of large numbers will result in exploding gradients for first weights

Can be mitigated with regularization and clipping of weights

Image 1 credit https://www.superannotate.com/blog/activation-functions-in-neural-networks

Image 2 credit

Deep Learning, Goodfellow et al

25 of 53

The art of Testing or Regularization

What if the testing performance is poor after training?

It means that we are modelling a specific variance of our train dataset

Epochs

Loss

26 of 53

The art of Testing or Regularization

What if the testing performance is poor after training?

It means that we are modelling a specific variance of our train dataset:

Use more training data
Regularization techniques
Preprocessing of data

27 of 53

Regularization I: Weight decay

We add a term in the loss function proportional to the L2-norm of the weights

Penalizes large weights

But why do smaller weights correspond to simpler models?

from: Deep Learning, Goodfellow et al

28 of 53

Regularization I: Weight decay

Limits model complexity and non-linearity: think of an N-layer network as a Nth degree polynomial for one input feature when the others are fixed.

y = f_n(W_n * ...f_1(W*x + b)...)

We are reducing the coefficients of such a polynomial, hence making it less expressive!

from: Deep Learning, Goodfellow et al

29 of 53

Regularization II: Batch Normalization

Increases robustness by subtraction of “batch-random” mean and variance

30 of 53

Batch Normalization may help in the following case

Dog, y=1

Not Dog, y=0

Train:

31 of 53

Batch Normalization may help in the following case

Dog, y=1

Not Dog, y=0

Dog, y=1

Not Dog, y=0

Train:

Test:

32 of 53

Batch Normalization may reduce covariance shift

Train:

Test:

Covariance shift refers to changes in the data distribution between training and testing, affecting model performance.

Batch normalization helps by normalizing input features across mini-batches, reducing internal covariate shift and stabilizing learning, thus improving generalization across varying distributions.

BatchNorm??

33 of 53

(Data) Regularization III: Normalization of Inputs

It’s standard practice to normalize the input of the network to have mean 0 and variance 1

This helps to make the gradients space more regular and speed up training

Image credits: https://heytech.tistory.com/

34 of 53

Regularization IV: Dropout

We don’t want to relay on single input features, so we spread out the information through many “sub-networks”

At inference time, all of the sub-networks are activated and contribute to the output (“wisdom of the crowd”)

Image credit https://towardsdatascience.com/dropout-in-neural-networks-47a162d621d9

35 of 53

Data preprocessing can help with generalization

Image credits: https://learnai1.home.blog/2020/08/15/data-preprocessing-for-neural-networks/

36 of 53

Now hopefully things work ok!

Both the train and test errors are reasonable

We can deploy our models in the wild!

Epochs

Loss

37 of 53

It all depends on loss functions! Can we trust them?

Goodhart's law

“When a measure becomes a target, it ceases to be a good measure".

Related to overfitting. See also Kelly, 2017, “On the folly of rewarding A while hoping for B”

Image credit https://sohl-dickstein.github.io/2022/11/06/strong-Goodhart.html

38 of 53

Let’s discuss loss functions

The existence of suitable loss functions is what makes the entire game of ML possible

Usually defined according to our tasks

e.g MSE for regression, Cross-entropy for classification, and so on.

So far, we've mostly implied 'supervised' learning (like predicting house prices from features, or cat vs. dog from images).

But Machine Learning is broader! The main types depend on the data and the goal!

39 of 53

Supervised vs. Unsupervised Learning

Supervised: Labeled data (input X, output y)

Classification, regression

Unsupervised: Unlabeled data (input X only).

Discover hidden patterns, structure, or representations in the data.

Actually there’s more! Learning from the environment (RL, see Verena and Michael’s lectures) or…

https://freedium.cfd/b903bc09430e

40 of 53

Can we spot the two differences?

41 of 53

Can we spot the two differences?

42 of 53

What have we learned?

We actually know how to spot Van Gogh’s form all other styles

In comparing two similar painting we’ve actually learned something about the space of all paintings

How do we repeat this for ML?

43 of 53

We need a new way to structure our loss

We need a new type of loss function building a space by comparing two examples

We want similar point close, different point more apart

At the core of foundational models, see Sofia’s lectures

44 of 53

The end?

45 of 53

A new trend in frontier AI/DL: the “bitter lesson”

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. [...]

Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. [...]

And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation. [...] --- Rich Sutton

46 of 53

Two famous examples

Deep Blue and AlphaGo leveraged massive, brute force search strategies and then self-play

At the time, this was looked upon with dismay by the majority of computer-chess/go researchers who had pursued methods that leveraged human understanding of the special structure of chess/go!

47 of 53

BC/GAIL

Computer Vision

Convolutional NNs (+ResNets)

Natural Lang. Proc.

Recurrent NNs (+LSTMs)

Science

Graph NNs

Speech

Deep Belief Nets (+non-DL)

h e l l o

Slide from Lucas Beyer lbeyer@google.com [1] CNN image CC-BY-SA by Aphex34 for Wikipedia https://commons.wikimedia.org/wiki/File:Typical_cnn.png

[2] RNN image CC-BY-SA by GChe for Wikipedia https://commons.wikimedia.org/wiki/File:The_LSTM_Cell.svg [3] By NickDiCicco - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=119932650

GRBM

🔒

RBM

🔒

RBM

DBN

[1]

[2]

[3]

Pre-2017 Deep Learning

48 of 53

Computer Vision

Natural Lang. Proc.

Translation

Speech

Reinf. Learning

Graphs/Science

Slide from Lucas Beyer lbeyer@google.com Transformer image source: "Attention Is All You Need" paper

Current trends

49 of 53

So in the end, we’re back to stacking layers??

Image credit: Ilya Sutskever

Not quite!

Everything we’ve seen in this lecture is the backbone of scaling Deep(er) Networks and making them converge

Human/Physical knowledge is still extremely helpful and impactful in the short/medium term both for time and scale regimes

And most of the times we are dealing with finite resources to train/deploy

50 of 53

Physics is NOT industry

Different scales and necessities

Speed is a crucial requirement, often with no clear equivalent in industry

Not all architectures are suited to every application

51 of 53

Conclusions

From a simple set of linear algebra operations, we can construct incredible tools called Deep Neural Networks

Basic neurons are not enough! We need to introduce a set of algorithms and data preprocessing to make learning easier, more stable, and more generalizable

Different ways of using loss functions and leveraging scale seems very promising today… but who knows what the future holds!

1 of 53

2 of 53

3 of 53

4 of 53

5 of 53

6 of 53

7 of 53

8 of 53

9 of 53

10 of 53

11 of 53

12 of 53

13 of 53

14 of 53

15 of 53

16 of 53

17 of 53

18 of 53

19 of 53

20 of 53

21 of 53

22 of 53

23 of 53

24 of 53

25 of 53

26 of 53

27 of 53

28 of 53

29 of 53

30 of 53

31 of 53

32 of 53

33 of 53

34 of 53

35 of 53

36 of 53

37 of 53

38 of 53

39 of 53

40 of 53

41 of 53

42 of 53

43 of 53

44 of 53

45 of 53

46 of 53

47 of 53

48 of 53

49 of 53

50 of 53

51 of 53

52 of 53

53 of 53