1 of 53

Deep learning:

tips and tricks for getting it right

Francesco Vaselli

tCSC Machine Learning 2025, Malmo

2 of 53

What Pisa can teach us about Deep Learning

People tried to make a very tall tower

Turns out it’s not enough to stack one floor on the other if the foundations are not solid

It’s the same thing with Deep Learning:

Simply stacking many neurons is not enough!

2

3 of 53

Outline

A few things you may already know, motivated and trying to understand why we do them that way

  • Bias-variance tradeoff
  • Basic ML recipe
  • Training and Bias
  • Testing and Variance
  • Losses functions

3

4 of 53

Bias-variance tradeoff

Encountered in most statistical models

We have a bias when the model is too simple and it is not capturing the true relationship between x and y = h*(x)

e.g. a linear model cannot reproduce a quadratic one

4

Image credit: https://cs229.stanford.edu/main_notes.pdf

5 of 53

Bias-variance tradeoff

Encountered in most statistical models

When the model is too expressive, we risk fitting patterns of our small, finite training sample -- variance part of the error

e.g. a 5th-degree polynomial will overfit a small sample from a quadratic model

5

Image credit: https://cs229.stanford.edu/main_notes.pdf

6 of 53

Bias-variance tradeoff defines an optimal complexity

We must strike the right balance between a simple, highly biased model and a complex, variance sensitive one

Deep learning requires careful tuning and experimentation!

6

7 of 53

How do we address this? Basic ML recipe

7

Training dataset

How well I am modelling the process producing the training data

Checks for bias

Test dataset

How well we generalize to previously unseen instances of the data

Checks for variance

Deploy!

8 of 53

How do we address this? Basic ML recipe

8

Training dataset

How well I am modelling the process producing the training data

Checks for bias

Test dataset

How well we generalize to previously unseen instances of the data

Checks for variance

Deploy!

9 of 53

How do we address this? Basic ML recipe

9

Training dataset

How well I am modelling the process producing the training data

Checks for bias

Test dataset

How well we generalize to previously unseen instances of the data

Checks for variance

Deploy!

10 of 53

Data matters!!

10

Image credit: https://www.benchling.com/blog/building-a-strong-data-foundation-to-get-machine-learning-and-automation-right

11 of 53

Data matters!!

11

Image credit: https://www.benchling.com/blog/building-a-strong-data-foundation-to-get-machine-learning-and-automation-right

12 of 53

Scale drives deep learning progress!

Assuming:

  1. You can fit the training set pretty well.
  2. The training set performance generalizes pretty well to the test set.

12

13 of 53

Fortunately, we have an abundance of data!

13

14 of 53

We still need to be careful in how we handle them

Understand your data

  • Why is it relevant to your problem?
  • Is some important physical information missing?
  • Is the data correctly labelled?
  • Is the data introducing some unwanted correlations?
  • Is the data still relevant to my problem?

14

15 of 53

The art of Training

15

What can I do if the training performance is lacking?

It means that we are not learning the underlying statistical model

Epochs

Loss

Image credit: https://www.javatpoint.com/overfitting-in-machine-learning

16 of 53

The art of Training

16

What can I do if the training performance is lacking?

It means that we are not learning the underlying statistical model:

  • Train longer
  • Bigger network (more capacity)
  • Gradients spaces and learning algorithms
  • Vanishing/exploding gradients

17 of 53

Moving around the loss space can be tricky!

17

Image credit https://www.cs.umd.edu/~tomg/projects/landscapes

18 of 53

Basic gradient descent is limited

The naive stochastic/batch gradient descent has many limitations in how it navigates complex loss spaces:

  • sensitive to noise in current sample/batch
  • easy to get trapped in local minimum

18

https://optimization.cbe.cornell.edu/index.php?title=Momentum

19 of 53

Momentum helps overcoming these limitations

Momentum is an extension to the algorithm that builds inertia in a search direction to overcome local minima and oscillation of noisy gradients.

19

https://optimization.cbe.cornell.edu/index.php?title=Momentum

20 of 53

State-of-the-art: Adaptive Moment Estimation (Adam)

Adam is an adaptive learning rate algorithm:

  • It uses momentum
  • It dynamically adjusts the learning rate for each individual parameter within a model, rather than using a single global learning rate

20

21 of 53

Vanishing/Exploding gradients

21

Remember that the gradients of the loss depend on:

  • the derivative of the activation functions
  • the derivative of the outputs (i.e. the weights)

Image credits: https://towardsdatascience.com/neural-networks-backpropagation-by-dr-lihi-gur-arie-27be67d8fdce

22 of 53

The gradients can vanish for small derivatives

Some choices of activation functions have small derivatives

This can lead to chain multiplication of small numbers!

22

23 of 53

The gradients can vanish for small derivatives

This can lead to chain multiplication of small numbers, making the gradients of the initial layers effectively 0 and preventing learning

Mitigated through initialization and change of activation functions

23

Image 2 credits https://www.jefkine.com/general/2018/05/21/2018-05-21-vanishing-and-exploding-gradient-problems/

24 of 53

The gradients can explode for large weights

The weights can have a norm >>1

Multiplication of large numbers will result in exploding gradients for first weights

Can be mitigated with regularization and clipping of weights

24

Image 1 credit https://www.superannotate.com/blog/activation-functions-in-neural-networks

Image 2 credit

Deep Learning, Goodfellow et al

25 of 53

The art of Testing or Regularization

25

What if the testing performance is poor after training?

It means that we are modelling a specific variance of our train dataset

Epochs

Loss

26 of 53

The art of Testing or Regularization

26

What if the testing performance is poor after training?

It means that we are modelling a specific variance of our train dataset:

  • Use more training data
  • Regularization techniques
  • Preprocessing of data

27 of 53

Regularization I: Weight decay

We add a term in the loss function proportional to the L2-norm of the weights

Penalizes large weights

But why do smaller weights correspond to simpler models?

27

from: Deep Learning, Goodfellow et al

28 of 53

Regularization I: Weight decay

Limits model complexity and non-linearity: think of an N-layer network as a Nth degree polynomial for one input feature when the others are fixed.

y = f_n(W_n * ...f_1(W*x + b)...)

We are reducing the coefficients of such a polynomial, hence making it less expressive!

28

from: Deep Learning, Goodfellow et al

29 of 53

Regularization II: Batch Normalization

29

Increases robustness by subtraction of “batch-random” mean and variance

30 of 53

Batch Normalization may help in the following case

30

Dog, y=1

Not Dog, y=0

Train:

31 of 53

Batch Normalization may help in the following case

31

Dog, y=1

Not Dog, y=0

Dog, y=1

Not Dog, y=0

Train:

Test:

32 of 53

Batch Normalization may reduce covariance shift

32

Train:

Test:

Covariance shift refers to changes in the data distribution between training and testing, affecting model performance.

Batch normalization helps by normalizing input features across mini-batches, reducing internal covariate shift and stabilizing learning, thus improving generalization across varying distributions.

BatchNorm??

33 of 53

(Data) Regularization III: Normalization of Inputs

33

It’s standard practice to normalize the input of the network to have mean 0 and variance 1

This helps to make the gradients space more regular and speed up training

Image credits: https://heytech.tistory.com/

34 of 53

Regularization IV: Dropout

We don’t want to relay on single input features, so we spread out the information through many “sub-networks”

At inference time, all of the sub-networks are activated and contribute to the output (“wisdom of the crowd”)

34

Image credit https://towardsdatascience.com/dropout-in-neural-networks-47a162d621d9

35 of 53

Data preprocessing can help with generalization

35

Image credits: https://learnai1.home.blog/2020/08/15/data-preprocessing-for-neural-networks/

36 of 53

Now hopefully things work ok!

36

Both the train and test errors are reasonable

We can deploy our models in the wild!

Epochs

Loss

37 of 53

It all depends on loss functions! Can we trust them?

Goodhart's law

When a measure becomes a target, it ceases to be a good measure".

Related to overfitting. See also Kelly, 2017, “On the folly of rewarding A while hoping for B

37

Image credit https://sohl-dickstein.github.io/2022/11/06/strong-Goodhart.html

A

B

38 of 53

Let’s discuss loss functions

The existence of suitable loss functions is what makes the entire game of ML possible

Usually defined according to our tasks

e.g MSE for regression, Cross-entropy for classification, and so on.

So far, we've mostly implied 'supervised' learning (like predicting house prices from features, or cat vs. dog from images).

But Machine Learning is broader! The main types depend on the data and the goal!

38

39 of 53

Supervised vs. Unsupervised Learning

Supervised: Labeled data (input X, output y)

Classification, regression

Unsupervised: Unlabeled data (input X only).

Discover hidden patterns, structure, or representations in the data.

Actually there’s more! Learning from the environment (RL, see Verena and Michael’s lectures) or…

39

https://freedium.cfd/b903bc09430e

40 of 53

Can we spot the two differences?

40

41 of 53

Can we spot the two differences?

41

42 of 53

What have we learned?

42

We actually know how to spot Van Gogh’s form all other styles

In comparing two similar painting we’ve actually learned something about the space of all paintings

How do we repeat this for ML?

43 of 53

We need a new way to structure our loss

We need a new type of loss function building a space by comparing two examples

We want similar point close, different point more apart

At the core of foundational models, see Sofia’s lectures

43

44 of 53

The end?

44

45 of 53

A new trend in frontier AI/DL: the “bitter lesson”

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. [...]

Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. [...]

And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation. [...] --- Rich Sutton

45

46 of 53

Two famous examples

46

Deep Blue and AlphaGo leveraged massive, brute force search strategies and then self-play

At the time, this was looked upon with dismay by the majority of computer-chess/go researchers who had pursued methods that leveraged human understanding of the special structure of chess/go!

47 of 53

47

RL

BC/GAIL

Computer Vision

Convolutional NNs (+ResNets)

Natural Lang. Proc.

Recurrent NNs (+LSTMs)

Science

Graph NNs

Speech

Deep Belief Nets (+non-DL)

h e l l o

Slide from Lucas Beyer lbeyer@google.com [1] CNN image CC-BY-SA by Aphex34 for Wikipedia https://commons.wikimedia.org/wiki/File:Typical_cnn.png

[2] RNN image CC-BY-SA by GChe for Wikipedia https://commons.wikimedia.org/wiki/File:The_LSTM_Cell.svg [3] By NickDiCicco - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=119932650

GRBM

🔒

RBM

🔒

RBM

DBN

DBN

DBN

[1]

[2]

[3]

Pre-2017 Deep Learning

48 of 53

48

Computer Vision

Natural Lang. Proc.

Translation

Speech

Reinf. Learning

Graphs/Science

Slide from Lucas Beyer lbeyer@google.com Transformer image source: "Attention Is All You Need" paper

Current trends

49 of 53

So in the end, we’re back to stacking layers??

Image credit: Ilya Sutskever

49

Not quite!

Everything we’ve seen in this lecture is the backbone of scaling Deep(er) Networks and making them converge

Human/Physical knowledge is still extremely helpful and impactful in the short/medium term both for time and scale regimes

And most of the times we are dealing with finite resources to train/deploy

50 of 53

Physics is NOT industry

Different scales and necessities

Speed is a crucial requirement, often with no clear equivalent in industry

Not all architectures are suited to every application

50

51 of 53

Conclusions

From a simple set of linear algebra operations, we can construct incredible tools called Deep Neural Networks

Basic neurons are not enough! We need to introduce a set of algorithms and data preprocessing to make learning easier, more stable, and more generalizable

Different ways of using loss functions and leveraging scale seems very promising today… but who knows what the future holds!

51

52 of 53

backup

52

53 of 53

We can do the math if needed

53