1
1
Lecture 8:
Training Neural Networks
Part II
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
2
2
Problem Set #2
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
3
3
Projects: Overview
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
26 Sep 2019
Lecture 8 -
26 Sep 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
4
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
26 Sep 2019
Lecture 8 -
26 Sep 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
5
5
Projects: Ideas
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
26 Sep 2019
Lecture 8 -
26 Sep 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
6
6
Overview
activation functions, preprocessing, weight initialization, regularization, gradient checking
babysitting the learning process,
parameter updates, hyperparameter optimization
model ensembles
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
7
7
Data Preprocessing
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
28 Sep 2017
Lecture 7 -
28 Sep 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
8
8
Step 1: Preprocess the data
(Assume X [NxD] is data matrix, each example in a row)
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
28 Sep 2017
Lecture 7 -
28 Sep 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
9
9
Preprocessing: Why are we doing this?
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
28 Sep 2017
Lecture 7 -
28 Sep 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
10
10
Step 1: Preprocess the data
In practice, you may also see PCA and Whitening of the data
(data has diagonal covariance matrix)
(covariance matrix is the identity matrix)
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
28 Sep 2017
Lecture 7 -
28 Sep 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
11
11
In practice for Images: center only
(mean image = [32,32,3] array)
(mean along each channel = 3 numbers)
e.g. consider CIFAR-10 example with [32,32,3] images
Not common to normalize variance, to do PCA or whitening
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
28 Sep 2017
Lecture 7 -
28 Sep 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
12
12
Another idea for a project
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
28 Sep 2017
Lecture 7 -
28 Sep 2017
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
13
13
Weight Initialization
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
14
14
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
15
15
(Gaussian with zero mean and 1e-2 standard deviation)
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
16
16
(Gaussian with zero mean and 1e-2 standard deviation)
Works ~okay for small networks, but can lead to non-homogeneous distributions of activations across the layers of a network.
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
17
17
Let’s look at some activation statistics
E.g. 10-layer net with 500 neurons on each layer, using tanh non-linearities, and initializing as described in last slide.
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
18
18
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
19
19
All activations become zero!
Q: think about the backward pass. What do the gradients look like?
Hint: think about backward pass for a W*X gate.
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
20
20
Almost all neurons completely saturated, either -1 and 1. Gradients will be all zero.
*1.0 instead of *0.01
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
21
21
“Xavier initialization”
[Glorot et al., 2010]
Reasonable initialization.
(Mathematical derivation assumes linear activations)
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
22
22
but when using the ReLU nonlinearity it breaks.
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
23
23
He et al., 2015
(note additional /2)
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
24
24
He et al., 2015
(note additional /2)
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
25
25
Proper initialization is an active area of research…
Understanding the difficulty of training deep feedforward neural networks
by Glorot and Bengio, 2010
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013
Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014
Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015
Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015
All you need is a good init, Mishkin and Matas, 2015
…
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
26
26
Batch Normalization
“you want unit Gaussian activations? just make them so.”
[Ioffe and Szegedy, 2015]
consider a batch of activations at some layer. To make each dimension unit Gaussian, apply:
this is a vanilla differentiable function...
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
27
27
Batch Normalization
“you want unit Gaussian activations?
just make them so.”
[Ioffe and Szegedy, 2015]
X
N
D
1. compute the empirical mean and variance independently for each dimension.
2. Normalize
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
28
28
Batch Normalization
[Ioffe and Szegedy, 2015]
FC
BN
tanh
FC
BN
tanh
...
Usually inserted after Fully Connected / (or Convolutional, as we’ll see soon) layers, and before nonlinearity.
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
29
29
Batch Normalization
[Ioffe and Szegedy, 2015]
And then allow the network to squash
the range if it wants to:
Note, the network can learn:
to recover the identity mapping.
Normalize:
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
30
30
Batch Normalization
[Ioffe and Szegedy, 2015]
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
31
31
Batch Normalization
[Ioffe and Szegedy, 2015]
Note: at test time BatchNorm layer functions differently:
The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used.
(e.g. can be estimated during training with running averages)
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 8 -
Sep 26 2019
Lecture 9 -
Sep 26 2019
Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson