1 of 31

1

1

Lecture 8:

Training Neural Networks

Part II

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

2 of 31

2

2

Problem Set #2

  1. Will be on-line today. Due on October 15.

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

3 of 31

3

3

Projects: Overview

  • Parts
    1. Proposal, Milestone, Paper, Review Period
  • Proposal: due Oct. 10
    • Can’t use extension for this!!!

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

26 Sep 2019

Lecture 8 -

26 Sep 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

4 of 31

4

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

26 Sep 2019

Lecture 8 -

26 Sep 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

5 of 31

5

5

Projects: Ideas

  • Hyperthyroidism, acromegaly recognition (possible paper)
  • Efficient hyperparameter search
  • Fish recognition using colorization (difficult)
  • Other methods of unsupervised or self-supervised learning
    • Exemplar CNN
    • SphereFace, ArcFace
  • Improved detection using automatically mined hard examples
  • NLP stuff
  • Image captioning
  • Reinforcement Learning
  • Autonomous driving
  • Bias in face recognition
  • LOOK AT PROJECTS ON CS231n web site:
  • Check out this page: http://cs231n.stanford.edu/slides/2019/cs231n_2019_section02.pdf

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

26 Sep 2019

Lecture 8 -

26 Sep 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

6 of 31

6

6

Overview

  • One time setup

activation functions, preprocessing, weight initialization, regularization, gradient checking

  • Training dynamics

babysitting the learning process,

parameter updates, hyperparameter optimization

  • Evaluation

model ensembles

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

7 of 31

7

7

Data Preprocessing

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

28 Sep 2017

Lecture 7 -

28 Sep 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 of 31

8

8

Step 1: Preprocess the data

(Assume X [NxD] is data matrix, each example in a row)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

28 Sep 2017

Lecture 7 -

28 Sep 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

9 of 31

9

9

Preprocessing: Why are we doing this?

  • Subtracting off the mean
    • Avoid gradients that only point in two different orthants.
  • Normalizing the magnitude
    • Kilometers vs. millimeters…
      • Invariance to the specific *units* of the inputs...

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

28 Sep 2017

Lecture 7 -

28 Sep 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

10 of 31

10

10

Step 1: Preprocess the data

In practice, you may also see PCA and Whitening of the data

(data has diagonal covariance matrix)

(covariance matrix is the identity matrix)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

28 Sep 2017

Lecture 7 -

28 Sep 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

11 of 31

11

11

In practice for Images: center only

  • Subtract the mean image (e.g. AlexNet)

(mean image = [32,32,3] array)

  • Subtract per-channel mean (e.g. VGGNet)

(mean along each channel = 3 numbers)

e.g. consider CIFAR-10 example with [32,32,3] images

Not common to normalize variance, to do PCA or whitening

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

28 Sep 2017

Lecture 7 -

28 Sep 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

12 of 31

12

12

Another idea for a project

  • The probability integral transform
    • Complicated name for a simple thing.
    • Replace features with the percentile in their own distribution.

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

28 Sep 2017

Lecture 7 -

28 Sep 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

13 of 31

13

13

Weight Initialization

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

14 of 31

14

14

  • Q: what happens when W=0 init is used?

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

15 of 31

15

15

  • First idea: Small random numbers

(Gaussian with zero mean and 1e-2 standard deviation)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

16 of 31

16

16

  • First idea: Small random numbers

(Gaussian with zero mean and 1e-2 standard deviation)

Works ~okay for small networks, but can lead to non-homogeneous distributions of activations across the layers of a network.

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

17 of 31

17

17

Let’s look at some activation statistics

E.g. 10-layer net with 500 neurons on each layer, using tanh non-linearities, and initializing as described in last slide.

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

18 of 31

18

18

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

19 of 31

19

19

All activations become zero!

Q: think about the backward pass. What do the gradients look like?

Hint: think about backward pass for a W*X gate.

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

20 of 31

20

20

Almost all neurons completely saturated, either -1 and 1. Gradients will be all zero.

*1.0 instead of *0.01

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

21 of 31

21

21

“Xavier initialization”

[Glorot et al., 2010]

Reasonable initialization.

(Mathematical derivation assumes linear activations)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

22 of 31

22

22

but when using the ReLU nonlinearity it breaks.

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

23 of 31

23

23

He et al., 2015

(note additional /2)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 of 31

24

24

He et al., 2015

(note additional /2)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

25 of 31

25

25

Proper initialization is an active area of research…

Understanding the difficulty of training deep feedforward neural networks

by Glorot and Bengio, 2010

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013

Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014

Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015

Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015

All you need is a good init, Mishkin and Matas, 2015

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

26 of 31

26

26

Batch Normalization

“you want unit Gaussian activations? just make them so.”

[Ioffe and Szegedy, 2015]

consider a batch of activations at some layer. To make each dimension unit Gaussian, apply:

this is a vanilla differentiable function...

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

27 of 31

27

27

Batch Normalization

“you want unit Gaussian activations?

just make them so.”

[Ioffe and Szegedy, 2015]

X

N

D

1. compute the empirical mean and variance independently for each dimension.

2. Normalize

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

28 of 31

28

28

Batch Normalization

[Ioffe and Szegedy, 2015]

FC

BN

tanh

FC

BN

tanh

...

Usually inserted after Fully Connected / (or Convolutional, as we’ll see soon) layers, and before nonlinearity.

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

29 of 31

29

29

Batch Normalization

[Ioffe and Szegedy, 2015]

And then allow the network to squash

the range if it wants to:

Note, the network can learn:

to recover the identity mapping.

Normalize:

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

30 of 31

30

30

Batch Normalization

[Ioffe and Szegedy, 2015]

  • Improves gradient flow through the network
  • Allows higher learning rates
  • Reduces the strong dependence on initialization
  • Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

31 of 31

31

31

Batch Normalization

[Ioffe and Szegedy, 2015]

Note: at test time BatchNorm layer functions differently:

The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used.

(e.g. can be estimated during training with running averages)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson