1 of 37

Deep Learning (DEEP-0001)�

9 – Performance

2 of 37

Measuring performance

  • MNIST1D dataset model and performance
  • Noise, bias, and variance
  • Reducing variance
  • Reducing bias & bias-variance trade-off
  • Double descent
  • Curse of dimensionality & weird properties of high dimensional space
  • Choosing hyperparameters

3 of 37

MNIST Dataset

4 of 37

MNIST 1D Dataset

5 of 37

Network

  • 40 inputs
  • 10 outputs
  • 4000 training examples (~400 training examples per class)
  • Two hidden layers
    • 100 hidden units each
  • SGD with batch size 100, learning rate 0.1
  • 6000 steps (150 Epochs)

6 of 37

Results

7 of 37

Need to use separate test data

8 of 37

Need to use separate test data

The model has not generalized well to the new data

9 of 37

Measuring performance

  • MNIST1D dataset model and performance
  • Noise, bias, and variance
  • Reducing variance
  • Reducing bias & bias-variance trade-off
  • Double descent
  • Curse of dimensionality & weird properties of high dimensional space
  • Choosing hyperparameters

10 of 37

Regression example

11 of 37

Toy model

  • K hidden units
  • First layer fixed so “joints” divide interval evenly
  • Second layer trained
  • But… now linear in h
    • so convex cost function
    • can find best sol in closed-form

12 of 37

Noise, bias, and variance

  • Noise in measurements
  • Some variables not observed
  • Data mislabeled

13 of 37

Noise, bias, and variance

14 of 37

Noise, bias, and variance

15 of 37

Noise, bias, and variance

  • Variance is the uncertainty in fitted model due to choice of training set
  • Bias is systematic deviation from the mean of the function we are modeling due to limitations in our model
  • Noise is inherent uncertainty in the true mapping from input to output

16 of 37

Measuring performance

  • MNIST1D dataset model and performance
  • Noise, bias, and variance
  • Reducing variance
  • Reducing bias & bias-variance trade-off
  • Double descent
  • Curse of dimensionality & weird properties of high dimensional space
  • Choosing hyperparameters

17 of 37

Variance

18 of 37

Variance

19 of 37

Variance

Can reduce variance by adding more samples

20 of 37

Measuring performance

  • MNIST1D dataset model and performance
  • Noise, bias, and variance
  • Reducing variance
  • Reducing bias & bias-variance trade-off
  • Double descent
  • Curse of dimensionality & weird properties of high dimensional space
  • Choosing hyperparameters

21 of 37

Reducing bias

22 of 37

Reducing bias

23 of 37

Why does variance increase? Overfitting

Describes the training data better, but not the true underlying function (black curve)

model with three regions

model with ten regions

24 of 37

Bias and variance trade-off

model capacity (number of hidden units / linear regions in range of data)

25 of 37

Measuring performance

  • MNIST1D dataset model and performance
  • Noise, bias, and variance
  • Reducing variance
  • Reducing bias & bias-variance trade-off
  • Double descent
  • Curse of dimensionality & weird properties of high dimensional space
  • Choosing hyperparameters

26 of 37

Number of datapoints

27 of 37

Double descent

28 of 37

29 of 37

  • Note that train data is very close to zero.
  • Whatever is happening isn’t happening at training data points
  • Must be happening between the data points??

30 of 37

Potential explanation:

    • can make smoother functions with more hidden units
    • being smooth between the datapoints is a reasonable thing to do

But why?

31 of 37

  • All of these solutions are equivalent in terms of loss.
  • Why should the model choose the smooth solution?
  • Tendency of model to choose one solution over another is inductive bias

32 of 37

Measuring performance

  • MNIST1D dataset model and performance
  • Noise, bias, and variance
  • Reducing variance
  • Reducing bias & bias-variance trade-off
  • Double descent
  • Curse of dimensionality & weird properties of high dimensional space
  • Choosing hyperparameters

33 of 37

Curse of dimensionality

  • As dimensionality increases, the volume of space grows so fast that the amount of data needed to densely sample it increases exponentially. This phenomenon is known as the curse of dimensionality.

34 of 37

Weird properties of high-dimensional space

  • As the distance from the center increases, the probability decreases, but the volume of space at that radius (i.e., the area between adjacent evenly spaced circles) increases.
  • These factors trade off so that the histogram of distances of samples from the center has a pronounced peak.

35 of 37

Weird properties of high-dimensional space

  • In higher dimensions, this effect becomes more extreme, and the probability of observing a sample close to the mean becomes vanishingly small. Although the most likely point is at the mean of the distribution, the typical samples are found in a relatively narrow shell.

36 of 37

Measuring performance

  • MNIST1D dataset model and performance
  • Noise, bias, and variance
  • Reducing variance
  • Reducing bias & bias-variance trade-off
  • Double descent
  • Curse of dimensionality & weird properties of high dimensional space
  • Choosing hyperparameters

37 of 37

Choosing hyperparameters

  • Don’t know bias or variance
  • Don’t know how much capacity to add
  • How do we choose capacity in practice?
    • Or model structure
    • Or training algorithm
    • Or learning rate
  • Third data set – validation set
    • Train models with different hyperparameters on training set
    • Choose best hyperparameters with validation set
    • Test once with test set