1 of 37

Deep Learning (DEEP-0001)�

Prof. André E. Lazzaretti

lazzaretti@utfpr.edu.br

https://sites.google.com/site/andrelazzaretti/graduate-courses/deep-learning-cpgei/2025

10 – Regularization

2 of 37

Regularization

Why is there a generalization gap between training and test data?

Overfitting (model describes statistical peculiarities)
Model unconstrained in areas where there are no training examples

Regularization = methods to reduce the generalization gap
Technically means adding terms to loss function
But colloquially means any method (hack) to reduce gap

3 of 37

Regularization

Explicit regularization
Early stopping
Ensembling
Dropout
Adding noise
Bayesian approaches
Transfer learning, multi-task learning, self-supervised learning
Data augmentation

4 of 37

Explicit regularization

5 of 37

Explicit regularization

6 of 37

Explicit regularization

7 of 37

Explicit regularization

8 of 37

Explicit regularization

9 of 37

Explicit regularization

10 of 37

Probabilistic interpretation

Maximum likelihood:

Regularization is equivalent to a adding a prior over parameters

… what you know about parameters before seeing the data

11 of 37

Equivalence

Explicit regularization:

Probabilistic interpretation:

Mapping:

12 of 37

Equivalence

Explicit regularization:

Probabilistic interpretation:

Mapping:

13 of 37

L2 Regularization

Can only use very general terms
Most common is L2 regularization
Favors smaller parameters

Also called Tikhonov regularization, ridge regression
In neural networks, usually just for weights and called weight decay

14 of 37

Why does L2 regularization help?

Discourages slavish adherence to the data (overfitting)
Encourages smoothness between datapoints

15 of 37

L2 regularization

16 of 37

Regularization

Explicit regularization
Early stopping
Ensembling
Dropout
Adding noise
Bayesian approaches
Transfer learning, multi-task learning, self-supervised learning
Data augmentation

17 of 37

Early stopping

If we stop training early, weights don’t have time to overfit to noise
Weights start small, don’t have time to get large
Reduces effective model complexity
Known as early stopping
Don’t have to re-train

18 of 37

19 of 37

Regularization

Explicit regularization
Early stopping
Ensembling
Dropout
Adding noise
Bayesian approaches
Transfer learning, multi-task learning, self-supervised learning
Data augmentation

20 of 37

Ensembling

Average together several models – an ensemble
Can take mean or median
Different initializations / different models
Different subsets of the data resampled with replacements -- bagging

21 of 37

22 of 37

Regularization

Explicit regularization
Early stopping
Ensembling
Dropout
Adding noise
Bayesian approaches
Transfer learning, multi-task learning, self-supervised learning
Data augmentation

23 of 37

Dropout

24 of 37

Dropout

Can eliminate kinks in function that are far from data and don’t contribute to training loss

25 of 37

Regularization

Explicit regularization
Early stopping
Ensembling
Dropout
Adding noise
Bayesian approaches
Transfer learning, multi-task learning, self-supervised learning
Data augmentation

26 of 37

Adding noise

to inputs
to weights
to outputs (labels)

27 of 37

Regularization

Explicit regularization
Early stopping
Ensembling
Dropout
Adding noise
Bayesian approaches
Transfer learning, multi-task learning, self-supervised learning
Data augmentation

28 of 37

Bayesian approaches

There are many parameters compatible with the data
Can find a probability distribution over them

Take all possible parameters into account when make prediction

Prior info about

parameters

29 of 37

Bayesian approaches

There are many parameters compatible with the data
Can find a probability distribution over them

Take all possible parameters into account when make prediction

Prior info about

parameters

30 of 37

Bayesian approaches

31 of 37

Regularization

Explicit regularization
Early stopping
Ensembling
Dropout
Adding noise
Bayesian approaches
Transfer learning, multi-task learning, self-supervised learning
Data augmentation

32 of 37

Transfer learning

Multi-task learning

Self-supervised learning

33 of 37

Transfer learning

Multi-task learning

Self-supervised learning

34 of 37

Transfer learning

Multi-task learning

Self-supervised learning

35 of 37

Regularization

Explicit regularization
Early stopping
Ensembling
Dropout
Adding noise
Bayesian approaches
Transfer learning, multi-task learning, self-supervised learning
Data augmentation

36 of 37

Data augmentation

37 of 37

Regularization overview