1 of 55

AI@MIT Workshop Series

Presentation based on Nikhil Murthy’s Coursera course “An Introduction to Practical Deep Learning”

Workshop 2:

Optimization & PyTorch Abstractions

2 of 55

About AI@MIT

Reading group (Wednesday 5-6 PM)

Workshops (Mondays 7-9 PM, biweekly)

Generator, Labs

Talks & panels, Compute Cluster and much more!

3 of 55

About AI@MIT

Attendance: �tinyurl.com/aim-workshop-2-signin

4 of 55

Workshop Schedule

9/20	Intro to Deep Learning and PyTorch
Today	Optimization & PyTorch Abstractions
10/18	Convolutional Neural Networks (CNNs)
11/1	Recurrent Neural Networks (RNNs)
11/15	TBD
11/29	TBD

5 of 55

Today’s Schedule

Multi-layer Perceptrons (MLP)
Activations, Initializations, and Costs
Optimizers - Gradient Descent and Variants
Backpropagation
PyTorch!

6 of 55

Types of Networks

MLP (Multilayer Perceptron)

CNN

(Convolutional Neural Networks)

RNN

(Recurrent Neural Networks)

Sources: http://bit.ly/2GHV0uS, http://bit.ly/2G3ynDk, http://bit.ly/2GJG13N

7 of 55

Types of Networks

MLP (Multilayer Perceptron)

CNN

(Convolutional Neural Networks)

RNN

(Recurrent Neural Networks)

Sources: http://bit.ly/2GHV0uS, http://bit.ly/2G3ynDk, http://bit.ly/2GJG13N

8 of 55

Your dataset is often represented as (X,Y), which is a set of (x⁽ⁱ⁾, y⁽ⁱ⁾) �
Each x⁽ⁱ⁾ and y⁽ⁱ⁾ are vectors of dimensions d_x and d_y respectively�
In this regime, you hope to learn a function

f(x⁽ⁱ⁾) = y⁽ⁱ⁾

The function should ideally generalize to new data

Data-Driven Learning

9 of 55

Linear Regression is just a 1-Layer Neural Network

For now:

σ(x) = x

𝛉_i is the i^th parameter

10 of 55

https://playground.tensorflow.org/

Tensorflow Playground

11 of 55

Practical Example: MNIST

Sources: http://bit.ly/2IDy8x9

MNIST Dataset (70,000 28 by 28 pixel images)

Classify images into digits 0 - 9

f(x⁽ⁱ⁾) = y⁽ⁱ⁾

12 of 55

Practical Example: MNIST

Sources: “An Introduction to Practical Deep Learning” Coursera Course

13 of 55

Practical Example: MNIST

How many parameters?

104,938!

W⁰: 784 x 128

b⁰: 128

W¹: 128 x 32

b¹: 32

W²: 32 x 10

b²: 10

Sources: “An Introduction to Practical Deep Learning” Coursera Course

14 of 55

Practical Example: MNIST

How many parameters?

104,938!

W⁰: 784 x 128

b⁰: 128

W¹: 128 x 32

b¹: 32

W²: 32 x 10

b²: 10

Sources: “An Introduction to Practical Deep Learning” Coursera Course

15 of 55

Practical Example: MNIST

Training Procedure

Initialize weights

Fetch a batch of data

Forward-pass

Cost

Backward-pass

Update weights

Sources: “An Introduction to Practical Deep Learning” Coursera Course

16 of 55

Practical Example: MNIST

Sources: “An Introduction to Practical Deep Learning” Coursera Course

Inference Procedure

Fetch data

Forward-pass

17 of 55

Unit, Artificial Neuron, Cell

a₁ⁱ

a₂ⁱ

a₃ⁱ

zⁱ⁺¹→ aⁱ⁺¹

w₁ⁱ

w₂ⁱ

w₃ⁱ

bⁱ

18 of 55

Activations

a₁ⁱ

a₂ⁱ

a₃ⁱ

zⁱ⁺¹→ aⁱ⁺¹

g( )

w₁ⁱ

w₂ⁱ

w₃ⁱ

bⁱ

19 of 55

Activations

Linear: g(x) = x

Binary step: g(x) = 0 (for x < 0), 1 (otherwise)

Logistic: g(x) = 1/(1 + e^-x)

Sources: http://bit.ly/2fE7id7

20 of 55

Activations

Linear

Binary step

Logistic

Tanh: g(x) = tanh(x)

ReLU: g(x) = max(0, x)

Softmax

Sources: http://bit.ly/2fE7id7

21 of 55

Activations

ReLU: g(x) = max(0, x)

Softmax

Tanh: g(x) = tanh(x)

Sources: http://bit.ly/2fE7id7

22 of 55

Activations

ReLU: g(x) = max(0, x)

Softmax

Tanh: g(x) = tanh(x)

SELU:

with λ =1.0507 and α =1.67326

Leaky ReLU

Sources: http://bit.ly/2fE7id7

23 of 55

Initializations

a₁ⁱ

a₂ⁱ

a₃ⁱ

zⁱ⁺¹→ aⁱ⁺¹

w₁ⁱ

w₂ⁱ

w₃ⁱ

bⁱ

24 of 55

Initializations

Gaussian	Gaussian(mean, std)
GlorotUniform	Uniform(-k, k)	Logistic
Xavier	Uniform(-k, k)	Logistic
Kaiming	Gaussian(0, σ^2)	ReLU

Sources: http://bit.ly/2vTlmaJ

25 of 55

Initializations

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. ICCV 2015.

26 of 55

Costs

Cross Entropy Loss

Misclassification Rate

L2 Loss - Mean Squared Error

L1 Loss - Mean Absolute Error

ŷ

y

C

Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course

27 of 55

Cross Entropy Loss

ŷ

y

0.0
0.1
0.0
0.2
0.1
0.1
0.0
0.4
0.1
0.0

0
0
0
0
1
0
0
0
0
0

Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course

28 of 55

Misclassification Rate vs. Cross Entropy

Why use one over the other?

Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course

29 of 55

Misclassification Rate vs. Cross Entropy

Why use one over the other?

Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course

Cross Entropy is good for making updates proportional to the error

i.e. Even if you misclassify, penalize less if you are close to the right answer.

30 of 55

Optimizers

Gradient descent

Stochastic Gradient Descent (SGD) with Momentum

RMS Propagation

Adagrad

Others: Adadelta, Adam, etc.

Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course

31 of 55

Optimizers

Gradient descent

Stochastic Gradient Descent (SGD) with Momentum

RMS Propagation

Adagrad

Others: Adadelta, Adam, etc.

Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course

32 of 55

Gradient Descent

What is the x-axis? What about the y-axis?

Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course

33 of 55

Gradient Descent

J(w⁽⁰⁾) = sum of costs using w⁽⁰⁾ for all training examples

w⁽⁰⁾

Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course

34 of 55

Gradient Descent

Where does the gradient of J(w⁽⁰⁾) with respect to w point to?

w⁽⁰⁾

Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course

35 of 55

Gradient Descent

Where does the gradient of J(w⁽⁰⁾) with respect to w point to?

w⁽⁰⁾

Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course

36 of 55

Gradient Descent

Where does the negative gradient of J(w⁽⁰⁾) with respect to w point to?

w⁽⁰⁾

Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course

37 of 55

Gradient Descent

Let’s take a step in that direction!

w⁽⁰⁾

w⁽¹⁾

Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course

38 of 55

Gradient Descent

How big of a step? Let’s add α, the learning rate

w⁽⁰⁾

w⁽¹⁾

Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course

39 of 55

Gradient Descent

w⁽¹⁾ = w⁽⁰⁾ - α dJ(w⁽⁰⁾)/dw

w⁽⁰⁾

w⁽¹⁾

Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course

40 of 55

Gradient Descent

w⁽²⁾ = w⁽¹⁾ - α dJ(w⁽¹⁾)/dw

w⁽⁰⁾

w⁽¹⁾

w⁽²⁾

Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course

41 of 55

Gradient Descent Issues

Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course

42 of 55

Gradient Descent Issues

Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course

Each step is computationally difficult

Saddle points

Sharp minimums

43 of 55

Stochastic Gradient Descent

Take a step using a batch of data points

Why might this be better?

Sources: http://bit.ly/2tZrmP7

44 of 55

Stochastic Gradient Descent

Saddle points

Faster

Sharp minima: Explores more

Sources: http://bit.ly/2tZrmP7

45 of 55

Stochastic Gradient Descent

Sources: http://bit.ly/2tZrmP7

46 of 55

Momentum

Intuition: Physics (ball rolling down hill)

Sources: http://bit.ly/2tZrmP7

47 of 55

Momentum

Sources: http://bit.ly/2tZrmP7

48 of 55

Adagrad

Normalize learning rate with respect to gradients (large steps when gradients are small)

Sources: http://bit.ly/2tZrmP7

49 of 55

Adagrad

Sources: http://bit.ly/2tZrmP7

50 of 55

RMS Propagation

Sources: http://bit.ly/2tZrmP7

51 of 55

Back Propagation

Sources: From “An Introduction to Practical Deep Learning” Coursera course

52 of 55

Back Propagation

Sources: From “An Introduction to Practical Deep Learning” Coursera course

53 of 55

Back Propagation

Sources: From “An Introduction to Practical Deep Learning” Coursera course

54 of 55

Looking Back: MNIST

Training Procedure

Initialize weights

Fetch a batch of data

Forward-pass

Cost

Backward-pass

Update weights

Sources: “An Introduction to Practical Deep Learning” Coursera Course

55 of 55

PyTorch!

Let’s go through another exercise on PyTorch

https://tinyurl.com/aim-workshop-2-lab