AI@MIT Workshop Series
Presentation based on Nikhil Murthy’s Coursera course “An Introduction to Practical Deep Learning”
Workshop 2:
Optimization & PyTorch Abstractions
About AI@MIT
Reading group (Wednesday 5-6 PM)
Workshops (Mondays 7-9 PM, biweekly)
Generator, Labs
Talks & panels, Compute Cluster and much more!
About AI@MIT
Attendance: �tinyurl.com/aim-workshop-2-signin
Workshop Schedule
9/20 | Intro to Deep Learning and PyTorch |
Today | Optimization & PyTorch Abstractions |
10/18 | Convolutional Neural Networks (CNNs) |
11/1 | Recurrent Neural Networks (RNNs) |
11/15 | TBD |
11/29 | TBD |
Today’s Schedule
Types of Networks
MLP (Multilayer Perceptron)
CNN
(Convolutional Neural Networks)
RNN
(Recurrent Neural Networks)
Sources: http://bit.ly/2GHV0uS, http://bit.ly/2G3ynDk, http://bit.ly/2GJG13N
Types of Networks
MLP (Multilayer Perceptron)
CNN
(Convolutional Neural Networks)
RNN
(Recurrent Neural Networks)
Sources: http://bit.ly/2GHV0uS, http://bit.ly/2G3ynDk, http://bit.ly/2GJG13N
f(x(i)) = y(i)
Data-Driven Learning
Linear Regression is just a 1-Layer Neural Network
For now:
σ(x) = x
𝛉i is the ith parameter
Tensorflow Playground
Practical Example: MNIST
Sources: http://bit.ly/2IDy8x9
MNIST Dataset (70,000 28 by 28 pixel images)
Classify images into digits 0 - 9
f(x(i)) = y(i)
Practical Example: MNIST
Sources: “An Introduction to Practical Deep Learning” Coursera Course
Practical Example: MNIST
How many parameters?
104,938!
W0: 784 x 128
b0: 128
W1: 128 x 32
b1: 32
W2: 32 x 10
b2: 10
Sources: “An Introduction to Practical Deep Learning” Coursera Course
Practical Example: MNIST
How many parameters?
104,938!
W0: 784 x 128
b0: 128
W1: 128 x 32
b1: 32
W2: 32 x 10
b2: 10
Sources: “An Introduction to Practical Deep Learning” Coursera Course
Practical Example: MNIST
Training Procedure
Initialize weights
Fetch a batch of data
Forward-pass
Cost
Backward-pass
Update weights
Sources: “An Introduction to Practical Deep Learning” Coursera Course
Practical Example: MNIST
Sources: “An Introduction to Practical Deep Learning” Coursera Course
Inference Procedure
Fetch data
Forward-pass
Unit, Artificial Neuron, Cell
a1i
a2i
a3i
zi+1→ ai+1
w1i
w2i
w3i
bi
Activations
a1i
a2i
a3i
zi+1→ ai+1
g( )
w1i
w2i
w3i
bi
Activations
Linear: g(x) = x
Binary step: g(x) = 0 (for x < 0), 1 (otherwise)
Logistic: g(x) = 1/(1 + e-x)
Sources: http://bit.ly/2fE7id7
Activations
Linear
Binary step
Logistic
Tanh: g(x) = tanh(x)
ReLU: g(x) = max(0, x)
Softmax
Sources: http://bit.ly/2fE7id7
Activations
ReLU: g(x) = max(0, x)
Softmax
Tanh: g(x) = tanh(x)
Sources: http://bit.ly/2fE7id7
Activations
ReLU: g(x) = max(0, x)
Softmax
Tanh: g(x) = tanh(x)
SELU:
with λ =1.0507 and α =1.67326
Leaky ReLU
Sources: http://bit.ly/2fE7id7
Initializations
a1i
a2i
a3i
zi+1→ ai+1
w1i
w2i
w3i
bi
Initializations
Gaussian | Gaussian(mean, std) | |
GlorotUniform | Uniform(-k, k) | Logistic |
Xavier | Uniform(-k, k) | Logistic |
Kaiming | Gaussian(0, σ^2) | ReLU |
Sources: http://bit.ly/2vTlmaJ
Initializations
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. ICCV 2015.
Costs
Cross Entropy Loss
Misclassification Rate
L2 Loss - Mean Squared Error
L1 Loss - Mean Absolute Error
ŷ
y
C
Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course
Cross Entropy Loss
ŷ
y
0.0 |
0.1 |
0.0 |
0.2 |
0.1 |
0.1 |
0.0 |
0.4 |
0.1 |
0.0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course
Misclassification Rate vs. Cross Entropy
Why use one over the other?
Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course
Misclassification Rate vs. Cross Entropy
Why use one over the other?
Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course
Cross Entropy is good for making updates proportional to the error
i.e. Even if you misclassify, penalize less if you are close to the right answer.
Optimizers
Gradient descent
Stochastic Gradient Descent (SGD) with Momentum
RMS Propagation
Adagrad
Others: Adadelta, Adam, etc.
Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course
Optimizers
Gradient descent
Stochastic Gradient Descent (SGD) with Momentum
RMS Propagation
Adagrad
Others: Adadelta, Adam, etc.
Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course
Gradient Descent
What is the x-axis? What about the y-axis?
Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course
Gradient Descent
J(w(0)) = sum of costs using w(0) for all training examples
w(0)
Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course
Gradient Descent
Where does the gradient of J(w(0)) with respect to w point to?
w(0)
Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course
Gradient Descent
Where does the gradient of J(w(0)) with respect to w point to?
w(0)
Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course
Gradient Descent
Where does the negative gradient of J(w(0)) with respect to w point to?
w(0)
Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course
Gradient Descent
Let’s take a step in that direction!
w(0)
w(1)
Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course
Gradient Descent
How big of a step? Let’s add α, the learning rate
w(0)
w(1)
Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course
Gradient Descent
w(1) = w(0) - α dJ(w(0))/dw
w(0)
w(1)
Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course
Gradient Descent
w(2) = w(1) - α dJ(w(1))/dw
w(0)
w(1)
w(2)
Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course
Gradient Descent Issues
Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course
Gradient Descent Issues
Sources: Adapted from “An Introduction to Practical Deep Learning” Coursera course
Each step is computationally difficult
Saddle points
Sharp minimums
Stochastic Gradient Descent
Take a step using a batch of data points
Why might this be better?
Sources: http://bit.ly/2tZrmP7
Stochastic Gradient Descent
Saddle points
Faster
Sharp minima: Explores more
Sources: http://bit.ly/2tZrmP7
Stochastic Gradient Descent
Sources: http://bit.ly/2tZrmP7
Momentum
Intuition: Physics (ball rolling down hill)
Sources: http://bit.ly/2tZrmP7
Momentum
Sources: http://bit.ly/2tZrmP7
Adagrad
Normalize learning rate with respect to gradients (large steps when gradients are small)
Sources: http://bit.ly/2tZrmP7
Adagrad
Sources: http://bit.ly/2tZrmP7
RMS Propagation
Sources: http://bit.ly/2tZrmP7
Back Propagation
Sources: From “An Introduction to Practical Deep Learning” Coursera course
Back Propagation
Sources: From “An Introduction to Practical Deep Learning” Coursera course
Back Propagation
Sources: From “An Introduction to Practical Deep Learning” Coursera course
Looking Back: MNIST
Training Procedure
Initialize weights
Fetch a batch of data
Forward-pass
Cost
Backward-pass
Update weights
Sources: “An Introduction to Practical Deep Learning” Coursera Course
PyTorch!
Let’s go through another exercise on PyTorch
https://tinyurl.com/aim-workshop-2-lab