1 of 49

Deep Learning and Convolutional Neural Nets

Matt Boutell

boutell@rose-hulman.edu

2 of 49

Background: we are detecting sunsets using the classical image recognition paradigm

  1. Build model (choose kernel, C)
  2. Train (quadratic programming optimization with Lagrange mult)
  3. Predict the class of a new vector by taking the weighted sum of functions of the distances of the vector to the support vectors

Human- engineered feature extraction

Grid-based Color moments

384x256x3

7x7x6�=294

Classifier �(1-3 layers)

Support vector machine

Class [-1, 1]

feature vector

3 of 49

Reminder: Basic neural network architecture

“Shallow” net

Width

Depth

Each neuron pi = f(x)

Or a layer is p = f(x)

4 of 49

Background: We could swap out the SVM for a traditional (shallow) neural network

  • Build model (1-2 fully-connected hidden network layers)
  • Train (backpropagation to minimize loss)
  • Predict the class of a new vector by extracting features and forward-propagating the features through the neural network

Human- engineered feature extraction

Grid-based Color moments

384x256x3

7x7x6�=294

Classifier �(1-3 layers)

Neural net

Class (0-1)

feature vector

5 of 49

Deep learning is a vague term

“Deep” networks typically have 10+ layers.

For example, 25, 144, or 177 (we’ll use some of these!)

That’s many weights to learn.

And more choices of architectures.

Should layers be fully connected?

How to train them fast enough?

6 of 49

Deep learning is a new paradigm in machine learning

Deep networks learn both which features to use and how to classify them.

There are millions of parameters

Q1

7 of 49

Convolutional neural net layers come in several types

Convolution, ReLU, Pooling, fully-connected, softmax

8 of 49

Image classification network layers come in several types

Convolution of filters with input.

Many pics in this presentation are from AJ Piergiovanni, CSSE463 Guest Lecture https://docs.google.com/presentation/d/15Lm6_LTtWnWp1HRPQ6loI3vN55EKNOUi8hOSUypsFw8/

Q2a

9 of 49

Image classification network layers come in several types

Convolution of filters with input

10 of 49

Image classification network layers come in several types

Convolution of filters with input. A set of 3x3 weights must be learned for each filter, and we usually have 10-100 filters per level.

11 of 49

Image classification network layers come in several types

Convolution of filters with input. 3x3 weights x 10-100 filters per level. But note that each filter is applied to the whole image, so not fully-connected, but local and sparse.

12 of 49

Convolutional layers learn familiar features

The first layer filters learn edges and opponent colors,

Higher level filters learn more complex features

Example Filters

Kunihiko Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position”

2

13 of 49

Image classification network layers come in several types

Without nonlinearity, layers collapse.

ReLU (rectified linear unit, “rectifier” in the figure) is one of the simplest non-linear transfer functions.

This is transfer function #3:

Simple (fast). What is derivative?

Can you re-write using max? �g(x) = max(______, ______)

Q2b

14 of 49

Image classification network layers come in several types

Because we learn multiple filters at each level, the dimensionality would continue to increase. The solution is to pool data at each layer and downsample.

Types:

  1. Max-pooling
  2. Average-pooling
  3. Subsampling only

Example of max-pooling.

Q2c

15 of 49

Putting it all together

Convolution, ReLU, Pooling

Q2d,e,Q3

16 of 49

Deep learning is an old idea that is now practical

In 2012, a deep network was used to win the ImageNet Large Scale Visual Recognition Challenge (14M annotated images), bringing the top-5 error rate down from the previous 26.1% to 15.3%.

Deep networks keep winning and improving each year.

Why?

Faster hardware (GPUs)

Access to more training data

Algorithmic advances

www.deeplearningbook.org/

Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.

Q4

17 of 49

Gradient Descent

When the goal is to find the min or max of a function

In calculus, you’d solve f’(x) = 0

What if f’(x) can’t be solved for x?

    • Start at some point x, where f’(x) exists and take a small step in the direction of the gradient (to max) or in the opposite direction (to min)

Next slides from AJ Piergiovanni, CSSE463 Guest Lecture https://docs.google.com/presentation/d/15Lm6_LTtWnWp1HRPQ6loI3vN55EKNOUi8hOSUypsFw8/

Q5

18 of 49

Gradient Descent (of error = f(weight vector))

19 of 49

Gradient Descent

20 of 49

Gradient Descent

21 of 49

Gradient Descent

22 of 49

Gradient Descent

23 of 49

Gradient Descent

24 of 49

Gradient Descent

25 of 49

Gradient Descent

5

26 of 49

Gradient Descent - Local Min

27 of 49

Gradient Descent - Local Min

28 of 49

Gradient Descent - Local Min

29 of 49

Putting it all together

Convolution, ReLU, Pooling

  • Build model (many layers)
  • Train (gradient descent on error)
  • Predict the class of a new vector by forward propagation through the network

Q5

30 of 49

Stochastic Gradient Descent

Gradient descent uses all the training data before finding the next location.

Stochastic gradient descent divides the training data into mini-batches, which are smaller than the whole data set

    • Trains faster
    • Often converges much faster
    • But may not reach as optimal location as gradient descent

So 1 epoch (1 pass through the data set) is made of many mini-batches.

Q6

31 of 49

Training a neural network

Inputs:

  1. the training set (set of images)
  2. the network architecture (an array of layers)
  3. the options that include hyper-parameters:

options = trainingOptions('sgdm',...

'MiniBatchSize',32,...

'MaxEpochs',4,...

'InitialLearnRate',1e-4,...

'VerboseFrequency',1,...

'Plots','training-progress',...

'ValidationData',validateImages,...

'ValidationFrequency',numIterationsPerEpoch);

Output: a trained network (with learned weights)

Q7

32 of 49

The most important hyper-parameter is how long to train!

...the options that include hyper-parameters:

options = trainingOptions('sgdm',...

'MiniBatchSize',32,...

'MaxEpochs',4,...

'InitialLearnRate',1e-4,...

'VerboseFrequency',1,...

'Plots','training-progress',...

'ValidationData',validateImages,...

'ValidationFrequency',numIterationsPerEpoch, ...

'ValidationPatience', 3

);

MATLAB docs

Q8

Which curve is the training set error?

Which is the validation set error?

Can you tell where it starts to overfit?

The curves aren’t so smooth in practice, hence the term patience, the number of epochs for which the validation error is larger the min seen so far (even if not in a row) before training stops, returning the network that gave the min error.

33 of 49

Limitations

Deep learning is a black box - the learned weights are often not intuitive

They require LOTS of training data.

Need many, many (millions) images to get good accuracy when training from scratch

34 of 49

Overcoming limitations: transfer learning

Some researchers have released their trained networks.

AlexNet, GoogleNet, ResNet, or VGG-19.

Why would we use them? # images, speed, accuracy.

  1. Can you use them directly?
  2. Transfer: Can you swap out and re-train the classification layers for your problem?
  3. Can you run the feature extraction part only and save the activations for an SVM to classify?
  4. Or do you need to start from scratch?

These options are the basis for the next lab and the sunset detector part 2

MATLAB docs

Q9-11

35 of 49

Questions?

Q12-13

36 of 49

Visualizing CNNs

This next section is the senior thesis work of AJ Piergiovanni, RHIT CS/MA (2015), who went on to study deep learning at Indiana University.

Alternative: new results presented in Stanford course.

For reference

37 of 49

Deconvolutional neural network

Layer Above Reconstruction

Remove Bias and Unpool

Deconvolve

Pooling, rectification and bias

Convolution

Deconvolutional

Network

Convolutional

Network

Reconstruction

Layer Below Pooled Maps

switches

Pooled Maps

38 of 49

Inverting max-pooling

  • To invert max-pooling, switches are used to store where the max value is from
    • Some data is lost, but this preserves the max values, which are what affected the feature in the higher levels

Zeiler, M.D. & Fergus, R. Visualizing and Understanding Convolutional Networks.

39 of 49

Convolution

  • Takes the output from the convolution and convolves it with the transpose of the filters to reconstruct the input

Convolve with each filter

Two outputs

Convolve with transposed filters

One reconstructed output

Input

CNN

Deconvolutional Network

40 of 49

t-SNE

  • Unsupervised dimensionality reduction technique
  • Minimizes the difference between probability distributions
  • Maps high dimensional data to 2D space by preserving the density of the points

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE.

41 of 49

t-SNE

42 of 49

3-shape classification

Trained a CNN to classify images of rectangles, circles and triangles.

43 of 49

44 of 49

45 of 49

46 of 49

Reconstructed Inputs:

47 of 49

More deconvolution

48 of 49

49 of 49

Title

Text