Deep learning:
tips and tricks for getting it right
Francesco Vaselli
tCSC Machine Learning 2025, Malmo
What Pisa can teach us about Deep Learning
People tried to make a very tall tower
Turns out it’s not enough to stack one floor on the other if the foundations are not solid
It’s the same thing with Deep Learning:
Simply stacking many neurons is not enough!
2
Outline
A few things you may already know, motivated and trying to understand why we do them that way
3
Bias-variance tradeoff
Encountered in most statistical models
We have a bias when the model is too simple and it is not capturing the true relationship between x and y = h*(x)
e.g. a linear model cannot reproduce a quadratic one
4
Image credit: https://cs229.stanford.edu/main_notes.pdf
Bias-variance tradeoff
Encountered in most statistical models
When the model is too expressive, we risk fitting patterns of our small, finite training sample -- variance part of the error
e.g. a 5th-degree polynomial will overfit a small sample from a quadratic model
5
Image credit: https://cs229.stanford.edu/main_notes.pdf
Bias-variance tradeoff defines an optimal complexity
We must strike the right balance between a simple, highly biased model and a complex, variance sensitive one
Deep learning requires careful tuning and experimentation!
6
How do we address this? Basic ML recipe
7
Training dataset
How well I am modelling the process producing the training data
Checks for bias
Test dataset
How well we generalize to previously unseen instances of the data
Checks for variance
Deploy!
How do we address this? Basic ML recipe
8
Training dataset
How well I am modelling the process producing the training data
Checks for bias
Test dataset
How well we generalize to previously unseen instances of the data
Checks for variance
Deploy!
How do we address this? Basic ML recipe
9
Training dataset
How well I am modelling the process producing the training data
Checks for bias
Test dataset
How well we generalize to previously unseen instances of the data
Checks for variance
Deploy!
Data matters!!
10
Image credit: https://www.benchling.com/blog/building-a-strong-data-foundation-to-get-machine-learning-and-automation-right
Data matters!!
11
Image credit: https://www.benchling.com/blog/building-a-strong-data-foundation-to-get-machine-learning-and-automation-right
Scale drives deep learning progress!
Assuming:
12
Fortunately, we have an abundance of data!
13
We still need to be careful in how we handle them
Understand your data
14
The art of Training
15
What can I do if the training performance is lacking?
It means that we are not learning the underlying statistical model
Epochs
Loss
Image credit: https://www.javatpoint.com/overfitting-in-machine-learning
The art of Training
16
What can I do if the training performance is lacking?
It means that we are not learning the underlying statistical model:
Moving around the loss space can be tricky!
17
Image credit https://www.cs.umd.edu/~tomg/projects/landscapes
Basic gradient descent is limited
The naive stochastic/batch gradient descent has many limitations in how it navigates complex loss spaces:
18
https://optimization.cbe.cornell.edu/index.php?title=Momentum
Momentum helps overcoming these limitations
Momentum is an extension to the algorithm that builds inertia in a search direction to overcome local minima and oscillation of noisy gradients.
19
https://optimization.cbe.cornell.edu/index.php?title=Momentum
State-of-the-art: Adaptive Moment Estimation (Adam)
Adam is an adaptive learning rate algorithm:
20
Vanishing/Exploding gradients
21
Remember that the gradients of the loss depend on:
Image credits: https://towardsdatascience.com/neural-networks-backpropagation-by-dr-lihi-gur-arie-27be67d8fdce
The gradients can vanish for small derivatives
Some choices of activation functions have small derivatives
This can lead to chain multiplication of small numbers!
22
The gradients can vanish for small derivatives
This can lead to chain multiplication of small numbers, making the gradients of the initial layers effectively 0 and preventing learning
Mitigated through initialization and change of activation functions
23
Image 2 credits https://www.jefkine.com/general/2018/05/21/2018-05-21-vanishing-and-exploding-gradient-problems/
The gradients can explode for large weights
The weights can have a norm >>1
Multiplication of large numbers will result in exploding gradients for first weights
Can be mitigated with regularization and clipping of weights
24
Image 1 credit https://www.superannotate.com/blog/activation-functions-in-neural-networks
Image 2 credit
Deep Learning, Goodfellow et al
The art of Testing or Regularization
25
What if the testing performance is poor after training?
It means that we are modelling a specific variance of our train dataset
Epochs
Loss
The art of Testing or Regularization
26
What if the testing performance is poor after training?
It means that we are modelling a specific variance of our train dataset:
Regularization I: Weight decay
We add a term in the loss function proportional to the L2-norm of the weights
Penalizes large weights
But why do smaller weights correspond to simpler models?
27
from: Deep Learning, Goodfellow et al
Regularization I: Weight decay
Limits model complexity and non-linearity: think of an N-layer network as a Nth degree polynomial for one input feature when the others are fixed.
y = f_n(W_n * ...f_1(W*x + b)...)
We are reducing the coefficients of such a polynomial, hence making it less expressive!
28
from: Deep Learning, Goodfellow et al
Regularization II: Batch Normalization
29
Increases robustness by subtraction of “batch-random” mean and variance
Batch Normalization may help in the following case
30
Dog, y=1
Not Dog, y=0
Train:
Batch Normalization may help in the following case
31
Dog, y=1
Not Dog, y=0
Dog, y=1
Not Dog, y=0
Train:
Test:
Batch Normalization may reduce covariance shift
32
Train:
Test:
Covariance shift refers to changes in the data distribution between training and testing, affecting model performance.
Batch normalization helps by normalizing input features across mini-batches, reducing internal covariate shift and stabilizing learning, thus improving generalization across varying distributions.
BatchNorm??
(Data) Regularization III: Normalization of Inputs
33
It’s standard practice to normalize the input of the network to have mean 0 and variance 1
This helps to make the gradients space more regular and speed up training
Image credits: https://heytech.tistory.com/
Regularization IV: Dropout
We don’t want to relay on single input features, so we spread out the information through many “sub-networks”
At inference time, all of the sub-networks are activated and contribute to the output (“wisdom of the crowd”)
34
Image credit https://towardsdatascience.com/dropout-in-neural-networks-47a162d621d9
Data preprocessing can help with generalization
35
Image credits: https://learnai1.home.blog/2020/08/15/data-preprocessing-for-neural-networks/
Now hopefully things work ok!
36
Both the train and test errors are reasonable
We can deploy our models in the wild!
Epochs
Loss
It all depends on loss functions! Can we trust them?
Goodhart's law
“When a measure becomes a target, it ceases to be a good measure".
Related to overfitting. See also Kelly, 2017, “On the folly of rewarding A while hoping for B”
37
Image credit https://sohl-dickstein.github.io/2022/11/06/strong-Goodhart.html
A
B
Let’s discuss loss functions
The existence of suitable loss functions is what makes the entire game of ML possible
Usually defined according to our tasks
e.g MSE for regression, Cross-entropy for classification, and so on.
So far, we've mostly implied 'supervised' learning (like predicting house prices from features, or cat vs. dog from images).
But Machine Learning is broader! The main types depend on the data and the goal!
38
Supervised vs. Unsupervised Learning
Supervised: Labeled data (input X, output y)
Classification, regression
Unsupervised: Unlabeled data (input X only).
Discover hidden patterns, structure, or representations in the data.
Actually there’s more! Learning from the environment (RL, see Verena and Michael’s lectures) or…
39
https://freedium.cfd/b903bc09430e
Can we spot the two differences?
40
Can we spot the two differences?
41
What have we learned?
42
We actually know how to spot Van Gogh’s form all other styles
In comparing two similar painting we’ve actually learned something about the space of all paintings
How do we repeat this for ML?
We need a new way to structure our loss
We need a new type of loss function building a space by comparing two examples
We want similar point close, different point more apart
At the core of foundational models, see Sofia’s lectures
43
The end?
44
A new trend in frontier AI/DL: the “bitter lesson”
The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. [...]
Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. [...]
And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation. [...] --- Rich Sutton
45
Two famous examples
46
Deep Blue and AlphaGo leveraged massive, brute force search strategies and then self-play
At the time, this was looked upon with dismay by the majority of computer-chess/go researchers who had pursued methods that leveraged human understanding of the special structure of chess/go!
47
RL
BC/GAIL
Computer Vision
Convolutional NNs (+ResNets)
Natural Lang. Proc.
Recurrent NNs (+LSTMs)
Science
Graph NNs
Speech
Deep Belief Nets (+non-DL)
h e l l o
Slide from Lucas Beyer lbeyer@google.com [1] CNN image CC-BY-SA by Aphex34 for Wikipedia https://commons.wikimedia.org/wiki/File:Typical_cnn.png
[2] RNN image CC-BY-SA by GChe for Wikipedia https://commons.wikimedia.org/wiki/File:The_LSTM_Cell.svg [3] By NickDiCicco - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=119932650
GRBM
🔒
RBM
🔒
RBM
DBN
DBN
DBN
[1]
[2]
[3]
Pre-2017 Deep Learning
48
Computer Vision
Natural Lang. Proc.
Translation
Speech
Reinf. Learning
Graphs/Science
Slide from Lucas Beyer lbeyer@google.com Transformer image source: "Attention Is All You Need" paper
Current trends
So in the end, we’re back to stacking layers??
Image credit: Ilya Sutskever
49
Not quite!
Everything we’ve seen in this lecture is the backbone of scaling Deep(er) Networks and making them converge
Human/Physical knowledge is still extremely helpful and impactful in the short/medium term both for time and scale regimes
And most of the times we are dealing with finite resources to train/deploy
Physics is NOT industry
Different scales and necessities
Speed is a crucial requirement, often with no clear equivalent in industry
Not all architectures are suited to every application
50
Conclusions
From a simple set of linear algebra operations, we can construct incredible tools called Deep Neural Networks
Basic neurons are not enough! We need to introduce a set of algorithms and data preprocessing to make learning easier, more stable, and more generalizable
Different ways of using loss functions and leveraging scale seems very promising today… but who knows what the future holds!
51
backup
52
We can do the math if needed
53