1 of 37

Deep Learning (DEEP-0001)�

10 – Regularization

2 of 37

Regularization

  • Why is there a generalization gap between training and test data?
    • Overfitting (model describes statistical peculiarities)
    • Model unconstrained in areas where there are no training examples
  • Regularization = methods to reduce the generalization gap
  • Technically means adding terms to loss function
  • But colloquially means any method (hack) to reduce gap

3 of 37

Regularization

  • Explicit regularization
  • Early stopping
  • Ensembling
  • Dropout
  • Adding noise
  • Bayesian approaches
  • Transfer learning, multi-task learning, self-supervised learning
  • Data augmentation

4 of 37

Explicit regularization

  •  

5 of 37

Explicit regularization

  •  

6 of 37

Explicit regularization

  •  

7 of 37

Explicit regularization

8 of 37

Explicit regularization

9 of 37

Explicit regularization

10 of 37

Probabilistic interpretation

  • Maximum likelihood:

  • Regularization is equivalent to a adding a prior over parameters

… what you know about parameters before seeing the data

11 of 37

Equivalence

  • Explicit regularization:

  • Probabilistic interpretation:

  • Mapping:

12 of 37

Equivalence

  • Explicit regularization:

  • Probabilistic interpretation:

  • Mapping:

13 of 37

L2 Regularization

  • Can only use very general terms
  • Most common is L2 regularization
  • Favors smaller parameters

  • Also called Tikhonov regularization, ridge regression
  • In neural networks, usually just for weights and called weight decay

14 of 37

Why does L2 regularization help?

  • Discourages slavish adherence to the data (overfitting)
  • Encourages smoothness between datapoints

15 of 37

L2 regularization

16 of 37

Regularization

  • Explicit regularization
  • Early stopping
  • Ensembling
  • Dropout
  • Adding noise
  • Bayesian approaches
  • Transfer learning, multi-task learning, self-supervised learning
  • Data augmentation

17 of 37

Early stopping

  • If we stop training early, weights don’t have time to overfit to noise
  • Weights start small, don’t have time to get large
  • Reduces effective model complexity
  • Known as early stopping
  • Don’t have to re-train

18 of 37

19 of 37

Regularization

  • Explicit regularization
  • Early stopping
  • Ensembling
  • Dropout
  • Adding noise
  • Bayesian approaches
  • Transfer learning, multi-task learning, self-supervised learning
  • Data augmentation

20 of 37

Ensembling

  • Average together several models – an ensemble
  • Can take mean or median
  • Different initializations / different models
  • Different subsets of the data resampled with replacements -- bagging

21 of 37

22 of 37

Regularization

  • Explicit regularization
  • Early stopping
  • Ensembling
  • Dropout
  • Adding noise
  • Bayesian approaches
  • Transfer learning, multi-task learning, self-supervised learning
  • Data augmentation

23 of 37

Dropout

24 of 37

Dropout

Can eliminate kinks in function that are far from data and don’t contribute to training loss

25 of 37

Regularization

  • Explicit regularization
  • Early stopping
  • Ensembling
  • Dropout
  • Adding noise
  • Bayesian approaches
  • Transfer learning, multi-task learning, self-supervised learning
  • Data augmentation

26 of 37

Adding noise

  • to inputs
  • to weights
  • to outputs (labels)

27 of 37

Regularization

  • Explicit regularization
  • Early stopping
  • Ensembling
  • Dropout
  • Adding noise
  • Bayesian approaches
  • Transfer learning, multi-task learning, self-supervised learning
  • Data augmentation

28 of 37

Bayesian approaches

  • There are many parameters compatible with the data
  • Can find a probability distribution over them

  • Take all possible parameters into account when make prediction

Prior info about

parameters

29 of 37

Bayesian approaches

  • There are many parameters compatible with the data
  • Can find a probability distribution over them

  • Take all possible parameters into account when make prediction

Prior info about

parameters

30 of 37

Bayesian approaches

31 of 37

Regularization

  • Explicit regularization
  • Early stopping
  • Ensembling
  • Dropout
  • Adding noise
  • Bayesian approaches
  • Transfer learning, multi-task learning, self-supervised learning
  • Data augmentation

32 of 37

  • Transfer learning

  • Multi-task learning

  • Self-supervised learning

33 of 37

  • Transfer learning

  • Multi-task learning

  • Self-supervised learning

34 of 37

  • Transfer learning

  • Multi-task learning

  • Self-supervised learning

35 of 37

Regularization

  • Explicit regularization
  • Early stopping
  • Ensembling
  • Dropout
  • Adding noise
  • Bayesian approaches
  • Transfer learning, multi-task learning, self-supervised learning
  • Data augmentation

36 of 37

Data augmentation

37 of 37

Regularization overview