1 of 31

Lecture 8:

Training Neural Networks

Part II

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

2 of 31

Problem Set #2

Will be on-line today. Due on October 15.

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

3 of 31

Projects: Overview

Parts

Proposal, Milestone, Paper, Review Period

Proposal: due Oct. 10

Can’t use extension for this!!!

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

26 Sep 2019

Lecture 8 -

26 Sep 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

4 of 31

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

26 Sep 2019

Lecture 8 -

26 Sep 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

5 of 31

Projects: Ideas

Hyperthyroidism, acromegaly recognition (possible paper)
Efficient hyperparameter search
Fish recognition using colorization (difficult)
Other methods of unsupervised or self-supervised learning

Exemplar CNN
SphereFace, ArcFace

Improved detection using automatically mined hard examples
NLP stuff
Image captioning
Reinforcement Learning
Autonomous driving
Bias in face recognition
LOOK AT PROJECTS ON CS231n web site:
Check out this page: http://cs231n.stanford.edu/slides/2019/cs231n_2019_section02.pdf

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

26 Sep 2019

Lecture 8 -

26 Sep 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

6 of 31

Overview

One time setup

activation functions, preprocessing, weight initialization, regularization, gradient checking

Training dynamics

babysitting the learning process,

parameter updates, hyperparameter optimization

Evaluation

model ensembles

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

7 of 31

Data Preprocessing

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

28 Sep 2017

Lecture 7 -

28 Sep 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

8 of 31

Step 1: Preprocess the data

(Assume X [NxD] is data matrix, each example in a row)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

28 Sep 2017

Lecture 7 -

28 Sep 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

9 of 31

Preprocessing: Why are we doing this?

Subtracting off the mean

Avoid gradients that only point in two different orthants.

Normalizing the magnitude

Kilometers vs. millimeters…

Invariance to the specific *units* of the inputs...

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

28 Sep 2017

Lecture 7 -

28 Sep 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

10 of 31

Step 1: Preprocess the data

In practice, you may also see PCA and Whitening of the data

(data has diagonal covariance matrix)

(covariance matrix is the identity matrix)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

28 Sep 2017

Lecture 7 -

28 Sep 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

11 of 31

In practice for Images: center only

Subtract the mean image (e.g. AlexNet)

(mean image = [32,32,3] array)

Subtract per-channel mean (e.g. VGGNet)

(mean along each channel = 3 numbers)

e.g. consider CIFAR-10 example with [32,32,3] images

Not common to normalize variance, to do PCA or whitening

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

28 Sep 2017

Lecture 7 -

28 Sep 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

12 of 31

Another idea for a project

The probability integral transform

Complicated name for a simple thing.
Replace features with the percentile in their own distribution.

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

28 Sep 2017

Lecture 7 -

28 Sep 2017

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

13 of 31

Weight Initialization

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

14 of 31

Q: what happens when W=0 init is used?

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

15 of 31

First idea: Small random numbers

(Gaussian with zero mean and 1e-2 standard deviation)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

16 of 31

First idea: Small random numbers

(Gaussian with zero mean and 1e-2 standard deviation)

Works ~okay for small networks, but can lead to non-homogeneous distributions of activations across the layers of a network.

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

17 of 31

Let’s look at some activation statistics

E.g. 10-layer net with 500 neurons on each layer, using tanh non-linearities, and initializing as described in last slide.

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

18 of 31

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

19 of 31

All activations become zero!

Q: think about the backward pass. What do the gradients look like?

Hint: think about backward pass for a W*X gate.

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

20 of 31

Almost all neurons completely saturated, either -1 and 1. Gradients will be all zero.

*1.0 instead of *0.01

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

21 of 31

“Xavier initialization”

[Glorot et al., 2010]

Reasonable initialization.

(Mathematical derivation assumes linear activations)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

22 of 31

but when using the ReLU nonlinearity it breaks.

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

23 of 31

He et al., 2015

(note additional /2)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

24 of 31

He et al., 2015

(note additional /2)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

25 of 31

Proper initialization is an active area of research…

Understanding the difficulty of training deep feedforward neural networks

by Glorot and Bengio, 2010

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013

Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014

Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015

Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015

All you need is a good init, Mishkin and Matas, 2015

…

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

26 of 31

Batch Normalization

“you want unit Gaussian activations? just make them so.”

[Ioffe and Szegedy, 2015]

consider a batch of activations at some layer. To make each dimension unit Gaussian, apply:

this is a vanilla differentiable function...

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

27 of 31

Batch Normalization

“you want unit Gaussian activations?

just make them so.”

[Ioffe and Szegedy, 2015]

1. compute the empirical mean and variance independently for each dimension.

2. Normalize

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

28 of 31

Batch Normalization

[Ioffe and Szegedy, 2015]

tanh

...

Usually inserted after Fully Connected / (or Convolutional, as we’ll see soon) layers, and before nonlinearity.

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

29 of 31

Batch Normalization

[Ioffe and Szegedy, 2015]

And then allow the network to squash

the range if it wants to:

Note, the network can learn:

to recover the identity mapping.

Normalize:

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

30 of 31

Batch Normalization

[Ioffe and Szegedy, 2015]

Improves gradient flow through the network
Allows higher learning rates
Reduces the strong dependence on initialization
Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

31 of 31

Batch Normalization

[Ioffe and Szegedy, 2015]

Note: at test time BatchNorm layer functions differently:

The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used.

(e.g. can be estimated during training with running averages)

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 -

Sep 26 2019

Lecture 9 -

Sep 26 2019

Erik Learned-Miller and TAs�Adapted from slides of Fei-Fei Li & Andrej Karpathy & Justin Johnson