1
Lecture 11:
Training Neural Networks
Part IV
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 & 31 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
2
Evaluation:
Model Ensembles
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
3
Enjoy 2% extra performance (?!!!)
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
4
Fun Tips/Tricks:
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
5
Fun Tips/Tricks:
parameter vector:
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
6
Regularization (dropout)
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
7
Regularization: Dropout
“randomly set some neurons to zero in the forward pass”
[Srivastava et al., 2014]
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8
Example forward pass with a 3-layer network using dropout
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
9
Waaaait a second…
How could this possibly be a good idea?
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
10
Forces the network to have a redundant representation.
has an ear
has a tail
is furry
has claws
mischievous
look
cat
score
X
X
X
Waaaait a second…
How could this possibly be a good idea?
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
11
Training with occlusions?
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
12
Another interpretation:
Dropout is training a large ensemble of models (that share parameters).
Each binary mask is one model, gets trained on only ~one datapoint.
Waaaait a second…
How could this possibly be a good idea?
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
13
At test time….
Ideally:
want to integrate out all the noise
Monte Carlo approximation:
do many forward passes with different dropout masks, average all predictions
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
14
At test time….
Can in fact do this with a single forward pass! (approximately)
Leave all input neurons turned on (no dropout).
(this can be shown to be an approximation to evaluating the whole ensemble)
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
15
At test time….
Can in fact do this with a single forward pass! (approximately)
Leave all input neurons turned on (no dropout).
Q: Suppose that with all inputs present at test time the output of this neuron is x.
What would its output be during training time, in expectation? (e.g. if p = 0.5)
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
16
At test time….
Can in fact do this with a single forward pass! (approximately)
x
y
Leave all input neurons turned on (no dropout).
during test: a = w0*x + w1*y
during train:
E[a] = ¼ * (w0*0 + w1*0
w0*0 + w1*y
w0*x + w1*0
w0*x + w1*y)
= ¼ * (2 w0*x + 2 w1*y)
= ½ * (w0*x + w1*y)
a
w0
w1
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
17
At test time….
Can in fact do this with a single forward pass! (approximately)
x
y
Leave all input neurons turned on (no dropout).
during test: a = w0*x + w1*y
during train:
E[a] = ¼ * (w0*0 + w1*0
w0*0 + w1*y
w0*x + w1*0
w0*x + w1*y)
= ¼ * (2 w0*x + 2 w1*y)
= ½ * (w0*x + w1*y)
a
With p=0.5, using all inputs in the forward pass would inflate the activations by 2x from what the network was “used to” during training!
=> Have to compensate by scaling the activations back down by ½
w0
w1
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
18
We can do something approximate analytically
At test time all neurons are active always
=> We must scale the activations so that for each neuron:
output at test time = expected output at training time
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
19
Dropout Summary
drop in forward pass
scale at test time
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
20
More common: “Inverted dropout”
test time is unchanged!
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
21
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
22
Siamese Networks (conjoined networks):
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
23
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 11 -
19 & 31 Oct 2017
Lecture 11 -
19 Oct 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson