Deep Learning and Convolutional Neural Nets
Matt Boutell
boutell@rose-hulman.edu
Background: we are detecting sunsets using the classical image recognition paradigm
Human- engineered feature extraction
Grid-based Color moments
384x256x3
7x7x6�=294
Classifier �(1-3 layers)
Support vector machine
Class [-1, 1]
feature vector
Reminder: Basic neural network architecture
“Shallow” net
Width
Depth
Each neuron pi = f(x)
Or a layer is p = f(x)
Background: We could swap out the SVM for a traditional (shallow) neural network
Human- engineered feature extraction
Grid-based Color moments
384x256x3
7x7x6�=294
Classifier �(1-3 layers)
Neural net
Class (0-1)
feature vector
Deep learning is a vague term
“Deep” networks typically have 10+ layers.
For example, 25, 144, or 177 (we’ll use some of these!)
That’s many weights to learn.
And more choices of architectures.
Should layers be fully connected?
How to train them fast enough?
Deep learning is a new paradigm in machine learning
Deep networks learn both which features to use and how to classify them.
There are millions of parameters
Q1
Convolutional neural net layers come in several types
Convolution, ReLU, Pooling, fully-connected, softmax
Image classification network layers come in several types
Convolution of filters with input.
Many pics in this presentation are from AJ Piergiovanni, CSSE463 Guest Lecture https://docs.google.com/presentation/d/15Lm6_LTtWnWp1HRPQ6loI3vN55EKNOUi8hOSUypsFw8/
Q2a
Image classification network layers come in several types
Convolution of filters with input
Image classification network layers come in several types
Convolution of filters with input. A set of 3x3 weights must be learned for each filter, and we usually have 10-100 filters per level.
Image classification network layers come in several types
Convolution of filters with input. 3x3 weights x 10-100 filters per level. But note that each filter is applied to the whole image, so not fully-connected, but local and sparse.
Convolutional layers learn familiar features
The first layer filters learn edges and opponent colors,
Higher level filters learn more complex features
Example Filters
Kunihiko Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position”
2
Image classification network layers come in several types
Without nonlinearity, layers collapse.
ReLU (rectified linear unit, “rectifier” in the figure) is one of the simplest non-linear transfer functions.
This is transfer function #3:
Simple (fast). What is derivative?
Can you re-write using max? �g(x) = max(______, ______)
Q2b
Image classification network layers come in several types
Because we learn multiple filters at each level, the dimensionality would continue to increase. The solution is to pool data at each layer and downsample.
Types:
Example of max-pooling.
Q2c
Putting it all together
Convolution, ReLU, Pooling
Q2d,e,Q3
Deep learning is an old idea that is now practical
In 2012, a deep network was used to win the ImageNet Large Scale Visual Recognition Challenge (14M annotated images), bringing the top-5 error rate down from the previous 26.1% to 15.3%.
Deep networks keep winning and improving each year.
Why?
Faster hardware (GPUs)
Access to more training data
Algorithmic advances
Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
Q4
Gradient Descent
When the goal is to find the min or max of a function
In calculus, you’d solve f’(x) = 0
What if f’(x) can’t be solved for x?
Next slides from AJ Piergiovanni, CSSE463 Guest Lecture https://docs.google.com/presentation/d/15Lm6_LTtWnWp1HRPQ6loI3vN55EKNOUi8hOSUypsFw8/
Q5
Gradient Descent (of error = f(weight vector))
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
5
Gradient Descent - Local Min
Gradient Descent - Local Min
Gradient Descent - Local Min
Putting it all together
Convolution, ReLU, Pooling
Q5
Stochastic Gradient Descent
Gradient descent uses all the training data before finding the next location.
Stochastic gradient descent divides the training data into mini-batches, which are smaller than the whole data set
So 1 epoch (1 pass through the data set) is made of many mini-batches.
Q6
Training a neural network
Inputs:
options = trainingOptions('sgdm',...
'MiniBatchSize',32,...
'MaxEpochs',4,...
'InitialLearnRate',1e-4,...
'VerboseFrequency',1,...
'Plots','training-progress',...
'ValidationData',validateImages,...
'ValidationFrequency',numIterationsPerEpoch);
Output: a trained network (with learned weights)
Q7
The most important hyper-parameter is how long to train!
...the options that include hyper-parameters:
options = trainingOptions('sgdm',...
'MiniBatchSize',32,...
'MaxEpochs',4,...
'InitialLearnRate',1e-4,...
'VerboseFrequency',1,...
'Plots','training-progress',...
'ValidationData',validateImages,...
'ValidationFrequency',numIterationsPerEpoch, ...
'ValidationPatience', 3
);
MATLAB docs
Q8
Which curve is the training set error?
Which is the validation set error?
Can you tell where it starts to overfit?
The curves aren’t so smooth in practice, hence the term patience, the number of epochs for which the validation error is larger the min seen so far (even if not in a row) before training stops, returning the network that gave the min error.
Limitations
Deep learning is a black box - the learned weights are often not intuitive
They require LOTS of training data.
Need many, many (millions) images to get good accuracy when training from scratch
Overcoming limitations: transfer learning
Some researchers have released their trained networks.
AlexNet, GoogleNet, ResNet, or VGG-19.
Why would we use them? # images, speed, accuracy.
These options are the basis for the next lab and the sunset detector part 2
Q9-11
Questions?
Q12-13
Visualizing CNNs
This next section is the senior thesis work of AJ Piergiovanni, RHIT CS/MA (2015), who went on to study deep learning at Indiana University.
Alternative: new results presented in Stanford course.
AJ Piergiovanni, CSSE463 Guest Lecture https://docs.google.com/presentation/d/15Lm6_LTtWnWp1HRPQ6loI3vN55EKNOUi8hOSUypsFw8/
For reference
Deconvolutional neural network
Layer Above Reconstruction
Remove Bias and Unpool
Deconvolve
Pooling, rectification and bias
Convolution
Deconvolutional
Network
Convolutional
Network
Reconstruction
Layer Below Pooled Maps
switches
Pooled Maps
Inverting max-pooling
Zeiler, M.D. & Fergus, R. Visualizing and Understanding Convolutional Networks.
Convolution
Convolve with each filter
Two outputs
Convolve with transposed filters
One reconstructed output
Input
CNN
Deconvolutional Network
t-SNE
L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE.
t-SNE
3-shape classification
Trained a CNN to classify images of rectangles, circles and triangles.
Reconstructed Inputs:
More deconvolution
Title
Text