1 of 23

Lecture 11:

Training Neural Networks

Part IV

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 & 31 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

2 of 23

Evaluation:

Model Ensembles

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

3 of 23

Train multiple independent models
At test time average their results

Enjoy 2% extra performance (?!!!)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

4 of 23

Fun Tips/Tricks:

can also get a small boost from averaging multiple model checkpoints of a single model.

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

5 of 23

Fun Tips/Tricks:

can also get a small boost from averaging multiple model checkpoints of a single model.
keep track of (and use at test time) a running average

parameter vector:

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

6 of 23

Regularization (dropout)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

7 of 23

Regularization: Dropout

“randomly set some neurons to zero in the forward pass”

[Srivastava et al., 2014]

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 of 23

Example forward pass with a 3-layer network using dropout

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

9 of 23

Waaaait a second…

How could this possibly be a good idea?

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

10 of 23

Forces the network to have a redundant representation.

has an ear

has a tail

is furry

has claws

mischievous

look

cat

score

Waaaait a second…

How could this possibly be a good idea?

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

11 of 23

Training with occlusions?

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

12 of 23

Another interpretation:

Dropout is training a large ensemble of models (that share parameters).

Each binary mask is one model, gets trained on only ~one datapoint.

Waaaait a second…

How could this possibly be a good idea?

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

13 of 23

At test time….

Ideally:

want to integrate out all the noise

Monte Carlo approximation:

do many forward passes with different dropout masks, average all predictions

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

14 of 23

At test time….

Can in fact do this with a single forward pass! (approximately)

Leave all input neurons turned on (no dropout).

(this can be shown to be an approximation to evaluating the whole ensemble)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

15 of 23

At test time….

Can in fact do this with a single forward pass! (approximately)

Leave all input neurons turned on (no dropout).

Q: Suppose that with all inputs present at test time the output of this neuron is x.

What would its output be during training time, in expectation? (e.g. if p = 0.5)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

16 of 23

At test time….

Can in fact do this with a single forward pass! (approximately)

Leave all input neurons turned on (no dropout).

during test: a = w0*x + w1*y

during train:

E[a] = ¼ * (w0*0 + w1*0

w0*0 + w1*y

w0*x + w1*0

w0*x + w1*y)

= ¼ * (2 w0*x + 2 w1*y)

= ½ * (w0*x + w1*y)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

17 of 23

At test time….

Can in fact do this with a single forward pass! (approximately)

Leave all input neurons turned on (no dropout).

during test: a = w0*x + w1*y

during train:

E[a] = ¼ * (w0*0 + w1*0

w0*0 + w1*y

w0*x + w1*0

w0*x + w1*y)

= ¼ * (2 w0*x + 2 w1*y)

= ½ * (w0*x + w1*y)

With p=0.5, using all inputs in the forward pass would inflate the activations by 2x from what the network was “used to” during training!

=> Have to compensate by scaling the activations back down by ½

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

18 of 23

We can do something approximate analytically

At test time all neurons are active always

=> We must scale the activations so that for each neuron:

output at test time = expected output at training time

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

19 of 23

Dropout Summary

drop in forward pass

scale at test time

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

20 of 23

More common: “Inverted dropout”

test time is unchanged!

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

21 of 23

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

22 of 23

Siamese Networks (conjoined networks):

Train a network on classification task.
Chop off classification layer.
Make two copies of the “beheaded network”.
Put one image into each network.
Calculate “distance” between two CNN vectors.
If distance is greater than threshold, say images are “different”. Otherwise, “same”.

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

23 of 23

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11 -

19 & 31 Oct 2017

Lecture 11 -

19 Oct 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson