1 of 179

CS60050: Machine Learning��Neural Networks

Sourangshu Bhattacharya

2 of 179

Neural Network Basics

Given several inputs:�and several weights:�and a bias value:

A neuron produces a single output:

�

This sum is called the activation of the neuron
The function s is called the activation function for the neuron
The weights and bias values are typically initialized randomly and learned during training

3 of 179

McCulloch–Pitts “unit”

3

Output is a “squashed” linear function of the inputs:

Output

Σ

Input Links

Activation Function

Input Function

Output Links

a₀= −1

a_i= g(in_i)

a_i

g

in_i

^Wj,i

W

0,i

Bias Weight

a_j

A gross oversimplification of real neurons, but its purpose is

to develop understanding of what networks of simple units can do

4 of 179

Activation functions

+1

in_i

g(in_i) g(in_i)

(a) (b)

is a step function or threshold function
is a sigmoid function 1/(1 + e⁻^x)

Changing the bias weight W₀_,imoves the threshold location

5 of 179

Feed forward example

6 of 179

Expressiveness of perceptrons

7 of 179

Feed Forward Neural Networks

8 of 179

Hidden-Layer

The hidden layer (L₂, L₃) represent learned non-linear combination of input data
For solving the XOR problem, we need a hidden layer

some neurons in the hidden layer will activate only for some combination of input features
the output layer can represent combination of the activations of the hidden neurons

Neural network with one hidden layer is a universal approximator

Every function can be modeled as a shallow feed forward network
Not all functions can be represented efficiently with a single hidden layer �⇒ we still need deep neural networks

9 of 179

How to learn the weights

Initialise the weights i.e. W_k,jW_j,i with random values
With input entries we calculate the predicted output
We compare the prediction with the true output
The error is calculated
The error needs to be sent as feedback for updating the weights

10 of 179

Neural Networks

11 of 179

What is a Neuron?

12 of 179

McCulloch–Pitts “unit”

12

Output is a “squashed” linear function of the inputs:

Output

Σ

Input Links

Activation Function

Input Function

Output Links

a₀= −1

a_i= g(in_i)

a_i

g

in_i

^Wj,i

W

0,i

Bias Weight

a_j

A gross oversimplification of real neurons, but its purpose is

to develop understanding of what networks of simple units can do

13 of 179

Activation functions

+1

in_i

g(in_i) g(in_i)

(a) (b)

is a step function or threshold function
is a sigmoid function 1/(1 + e⁻^x)

Changing the bias weight W₀_,imoves the threshold location

The purpose of the activation function is to add a non-linear transformation and �in some cases squash the output to a specified range.

14 of 179

Commonly used Activation functions

15 of 179

Perceptron

16 of 179

Perceptron: a 1-layer NN

Task: Classification

17 of 179

Perceptron: a 1-layer NN

18 of 179

Expressiveness of Perceptrons

19 of 179

Multi-layer Perceptrons

20 of 179

Feed Forward Neural Networks

21 of 179

Feed forward Neural Network Computation

22 of 179

Hidden-Layer

The hidden layer represent learned non-linear combination of input data�
For solving the XOR problem, we need a hidden layer

some neurons in the hidden layer will activate only for some combination of input features
the output layer can represent combination of the activations of the hidden neurons

23 of 179

Solution to the Xor problem

24 of 179

Composition of Transformations

25 of 179

Each neuron in neural network can be thought of as computing a useful representation of its inputs.
A layer – composed of many neuron can be thought as a full representation of the input datapoint.

This representation is independent of the other datapoints – hence distributed representations.

Distributed feature representation

26 of 179

Example: A useful feature transformation

27 of 179

Each neuron in neural network can be thought of as computing a useful representation of its inputs.
A layer – composed of many neuron can be thought as a full representation of the input datapoint.

This representation is independent of the other datapoints – hence distributed representations.

The representations are formed hierarchically in each layer - progressively becoming more useful for the end task.
The neural network can be thought as a composition of such “feature encoders”.

Distributed feature representation

28 of 179

Feed forward Neural Network Example

A Toy Neural Network

29 of 179

Feed forward Neural Network Example

30 of 179

Hidden-Layer

The hidden layer represent learned non-linear combination of input data
For solving the XOR problem, we need a hidden layer

some neurons in the hidden layer will activate only for some combination of input features
the output layer can represent combination of the activations of the hidden neurons

Neural network with one hidden layer is a universal approximator

Every function can be modeled as a shallow feed forward network
Not all functions can be represented efficiently with a single hidden layer �⇒ we still need deep neural networks

31 of 179

Deep Neural Networks

Neural networks with more than 2 / 3 / 4 layers are considered “Deep”.

There is no agreement on the threshold.�

Require systematic approach to:

Train – Backpropagation.
Design – Modular approach.
Debug – Issues like vanishing gradient.�

Deeper neural networks have been shown to perform better than shallow ones.

Hierarchical buildup of useful features.

32 of 179

Deep Neural Networks – hierarchical features

33 of 179

Shallow to Deep Neural Networks

Neural Networks can have several hidden layers
Initializing the weights randomly and training all layers at once does hardly work
Instead we train layerwise on unannotated data (a.k.a. pre-training):

Train the first hidden layer
Fix the parameters for the first layer and train the second layer.
Fix the parameters for the first & second layer, train the third layer

After the pre-training, train all layers using your annotated data
The pre-training on your unannotated data creates a high-level abstractions of the input data
The final training with annotated data fine tunes all parameters in the network�

34 of 179

TRAINING A NEURAL NETWORK

35 of 179

How to learn the weights ?

Initialise the weights i.e. W_k,jW_j,i with random values
With input entries we calculate the predicted output
We compare the prediction with the true output
The error is calculated
The error needs to be sent as feedback for updating the weights

36 of 179

Input

(Feature Vector)

Output

(Label)

Put in Training inputs, get the output
Compare output to correct answers: Look at loss function J
Adjust and repeat!
Backpropagation tells us how to make a single adjustment using calculus.

How to Train a Neural Net?

37 of 179

Gradient Descent!

Make prediction�
Calculate Loss�
Calculate gradient of the loss function w.r.t. parameters�
Update parameters by taking a step in the opposite direction�
Iterate

How to Train a Neural Net?

38 of 179

How to Train a Neural Net?

Gradient Descent!

Make prediction�
Calculate Loss�
Calculate gradient of the loss function w.r.t. parameters�
Update parameters by taking a step in the opposite direction�
Iterate

39 of 179

Forward Propagation

40 of 179

Pass in

Input

Forward Propagation

41 of 179

Calculate each Layer

Forward Propagation

42 of 179

Get Output

Forward Propagation

43 of 179

Forward Propagation

44 of 179

How to Train a Neural Net?

Gradient Descent!

Make prediction�
Calculate Loss�
Calculate gradient of the loss function w.r.t. parameters�
Update parameters by taking a step in the opposite direction�
Iterate

45 of 179

How to Train a Neural Net?

46 of 179

How to Train a Neural Net?

47 of 179

Feedforward Neural Network

Want:

48 of 179

Want:

Backward Pass

49 of 179

Backward Pass

50 of 179

Backward Pass

51 of 179

Backward Pass

52 of 179

Backpropagation Formula

53 of 179

How to Train a Neural Net?

54 of 179

Going from Shallow to Deep Neural Networks

Neural Networks can have several hidden layers
Initializing the weights randomly and training all layers at once does hardly work
Instead we train layerwise on unannotated data (a.k.a. pre-training):

Train the first hidden layer
Fix the parameters for the first layer and train the second layer.
Fix the parameters for the first & second layer, train the third layer

After the pre-training, train all layers using your annotated data
The pre-training on your unannotated data creates a high-level abstractions of the input data
The final training with annotated data fine tunes all parameters in the network�

55 of 179

56 of 179

57 of 179

58 of 179

Computational Graph

Definition: a data structure for storing gradients of variables used in computations.

Node v represents variable

Stores value
Gradient
The function that created the node�

Directed edge (u,v) represents the partial derivative of u w.r.t. v�
To compute the gradient dL/dv, find the unique path from L to v and multiply the edge weights.

59 of 179

Backpropagation for neural nets

Given softmax activation, L2 loss, a point (x1, x2, x3, y) = (0. 1, 0.15, 0.2, 1), compute the gradient

60 of 179

Backpropagation for neural nets: forward pass

61 of 179

Backpropagation for neural nets: backward pass

62 of 179

Computation Graphs

63 of 179

CONVOLUTIONAL NEURAL NETWORKS

64 of 179

Motivation – Image Data

So far, the structure of our neural network treats all inputs interchangeably.
No relationships between the individual inputs
Just an ordered set of variables

We want to incorporate domain knowledge into the architecture of a Neural Network.

65 of 179

Motivation

Image data has important structures, such as;

”Topology” of pixels
Translation invariance
Issues of lighting and contrast
Knowledge of human visual system
Nearby pixels tend to have similar values
Edges and shapes
Scale Invariance – objects may appear at different sizes in the image.

66 of 179

Motivation – Image Data

Fully connected would require a vast number of parameters
MNIST images are small (32 x 32 pixels) and in grayscale
Color images are more typically at least (200 x 200) pixels x 3 color channels (RGB) = 120,000 values.
A single fully connected layer would require (200x200x3)² = 14,400,000,000 weights!
Variance (in terms of bias-variance) would be too high
So we introduce “bias” by structuring the network to look for certain kinds of patterns

67 of 179

Motivation

Features need to be “built up”
Edges -> shapes -> relations between shapes
Textures

Cat = two eyes in certain relation to one another + cat fur texture.
Eyes = dark circle (pupil) inside another circle.
Circle = particular combination of edge detectors.
Fur = edges in certain pattern.

68 of 179

Kernels

69 of 179

Kernel: 3x3 Example

Input

Kernel

Output

3

2

1

2

3

1

-1

0

1

-2

0

2

-1

0

1

70 of 179

Kernel: 3x3 Example

3

2

1

2

3

1

-1

0

1

-2

0

2

-1

0

1

Output

71 of 179

Kernel: 3x3 Example

Input

Kernel

Output

3

2

1

2

3

1

-1

0

1

-2

0

2

-1

0

1

2

72 of 179

Kernel: Example

73 of 179

Kernels as Feature Detectors

Can think of kernels as a ”local feature detectors”

Vertical Line Detector

-1

1

-1

1

-1

1

-1

Horizontal Line Detector

-1

1

-1

Corner Detector

-1

1

-1

1

74 of 179

Convolutional Neural Nets

Primary Ideas behind Convolutional Neural Networks:

Let the Neural Network learn which kernels are most useful
Use same set of kernels across entire image (translation invariance)
Reduces number of parameters and “variance” (from bias-variance point of view)

75 of 179

Convolutions

76 of 179

Convolution Settings – Grid Size

Grid Size (Height and Width):

The number of pixels a kernel “sees” at once
Typically use odd numbers so that there is a “center” pixel
Kernel does not need to be square

Height: 3, Width: 3

Height: 1, Width: 3

Height: 3, Width: 1

77 of 179

Convolution Settings - Padding

Padding

Using Kernels directly, there will be an “edge effect”
Pixels near the edge will not be used as “center pixels” since there are not enough surrounding pixels
Padding adds extra pixels around the frame
So every pixel of the original image will be a center pixel as the kernel moves across the image
Added pixels are typically of value zero (zero-padding)

78 of 179

Without Padding

79 of 179

With Padding

80 of 179

Convolution Settings

Stride

The ”step size” as the kernel moves across the image
Can be different for vertical and horizontal steps (but usually is the same value)
When stride is greater than 1, it scales down the output dimension

81 of 179

Stride 2 Example – No Padding

3

0

82 of 179

Stride 2 Example – With Padding

-1

2

3

83 of 179

Convolutional Settings - Depth

In images, we often have multiple numbers associated with each pixel location.
These numbers are referred to as “channels”
RGB image – 3 channels
CMYK – 4 channels
The number of channels is referred to as the “depth”
So the kernel itself will have a “depth” the same size as the number of input channels
Example: a 5x5 kernel on an RGB image

There will be 5x5x3 = 75 weights

84 of 179

Convolutional Settings - Depth

The output from the layer will also have a depth
The networks typically train many different kernels
Each kernel outputs a single number at each pixel location
So if there are 10 kernels in a layer, the output of that layer will have depth 10.

85 of 179

Pooling

Idea: Reduce the image size by mapping a patch of pixels to a single value.
Shrinks the dimensions of the image.
Does not have parameters, though there are different types of pooling operations.

86 of 179

Pooling: Max-pool

For each distinct patch, represent it by the maximum
2x2 maxpool shown below

87 of 179

Pooling: Average-pool

For each distinct patch, represent it by the average
2x2 avgpool shown below.

88 of 179

ConvNet: CONV, RELU, POOL �and FC Layers

89 of 179

Convolution Layer

90 of 179

Convolution Layer

consider a second, green filter

91 of 179

Convolution Layer

92 of 179

ReLU (Rectified Linear Units) Layer

This is a layer of neurons that applies the activation function f(x)=max(0,x).
It increases the nonlinear properties of the decision function and of the overall network without affecting the receptive fields of the convolution layer.
Other functions are also used to increase nonlinearity, for example the hyperbolic tangent f(x)=tanh(x), and the sigmoid function.
This is also known as a ramp function.

93 of 179

A Basic ConvNet

94 of 179

What is convolution of an image with a filter

95 of 179

Details about the convolution layer

96 of 179

Details about the convolution layer

97 of 179

Details about the convolution layer

98 of 179

Convolution layer examples

99 of 179

Pooling Layer

100 of 179

Fei-Fei Li & Andrej Karpathy & Justin Johnson _{Lecture 7 - 27 Jan 2016}

Lecture 7 -

100

Where ReLu is used as f.

Convolutional Neural Networks

+ ReLu

101 of 179

Fei-Fei Li & Andrej Karpathy & Justin Johnson _{Lecture 7 - 27 Jan 2016}

Lecture 7 -

101

Kernel= [1,0,1

0,1,0

1,0,1]

Convolutional Neural Networks

1	0	1
0	1	0
1	0	1

102 of 179

Applications

103 of 179

Applications

104 of 179

ConvNet: CONV, RELU, POOL and FC Layers

Pytorch Implementation

105 of 179

ConvNet: CONV, RELU, POOL and FC Layers

Pytorch Implementation

106 of 179

EVOLUTION OF MODEL ARCHITECURES

107 of 179

ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

108 of 179

ILSVRC

109 of 179

AlexNet

Architecture

CONV1

MAX POOL1 NORM1 CONV2

MAX POOL2 NORM2 CONV3 CONV4 CONV5

Max POOL3 FC6

FC7 FC8

Input: 227x227x3 images (224x224 before padding)�
First layer: 96 11x11 filters applied at stride 4

Output volume size?

(N-F)/s+1 = (227-11)/4+1 = 55 -> [55x55x96]

Number of parameters in this layer? (11*11*3)*96 = 35K

110 of 179

AlexNet

Details/Retrospectives:

first use of ReLU
used Norm layers (not common anymore)
heavy data augmentation
dropout 0.5
batch size 128
7 CNN ensemble

111 of 179

ILSVRC winners

112 of 179

VGGNet

Smaller filters

Only 3x3 CONV filters, stride 1, pad 1 and 2x2 MAX POOL , stride 2

Deeper network

AlexNet: 8 layers

VGGNet: 16 - 19 layers

ZFNet: 11.7% top 5 error in ILSVRC’13
VGGNet: 7.3% top 5 error in ILSVRC’14

Input

3x3 conv, 64

3x3 conv, 64 Pool 1/2

3x3 conv, 128

3x3 conv, 128 Pool 1/2

3x3 conv, 256

3x3 conv, 256 Pool 1/2

3x3 conv, 512

3x3 conv, 512 Pool 1/2

3x3 conv, 512

3x3 conv, 512 Pool 1/2

FC 4096

FC 1000

Softmax

113 of 179

VGGNet

[Simonyan and Zisserman, 2014]

Why use smaller filters? (3x3 conv)

Stack of three 3x3 conv (stride 1) layers has the same effective receptive field as one 7x7 conv layer.

What is the effective receptive field of three 3x3 conv (stride 1) layers?

7x7

But deeper, more non-linearities

And fewer parameters: 3 * (3²C²) vs. 7²C²for C channels per layer

114 of 179

ILSVRC winners

115 of 179

GoogleNet

22 layers
Efficient “Inception” module - strayed from the general approach of simply stacking conv and pooling layers on top of each other in a sequential structure
No FC layers
Only 5 million parameters!
ILSVRC’14 classification winner (6.7% top 5 error)

116 of 179

GoogleNet

“Inception module”: design a good local network topology (network within a network) and then stack these modules on top of each other

[Szegedy et al., 2014]

Filter

concatenation

Previous layer

1x1

convolution

3x3

convolution

5x5

convolution

1x1

convolution

1x1

convolution

1x1

convolution

3x3 max

pooling

117 of 179

ILSVRC winners

118 of 179

ResNet

Deep Residual Learning for Image Recognition - Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun; 2015
Extremely deep network – 152 layers
Deeper neural networks are more difficult to train.
Deep networks suffer from vanishing and exploding gradients.
Present a residual learning framework to ease the training of networks that are substantially deeper than those used previously.

[He et al., 2015]

119 of 179

ResNet

What happens when we continue stacking deeper layers on a convolutional neural network?

56-layer model performs worse on both training and test error

-> The deeper model performs worse (not caused by overfitting)!

120 of 179

ResNet

Hypothesis: The problem is an optimization problem. Very deep networks are harder to optimize.
Solution: Use network layers to fit residual mapping instead of directly trying to fit a desired underlying mapping.

We will use skip connections allowing us to take the activation from one layer and feed it into another layer, much deeper into the network.
Use layers to fit residual F(x) = H(x) – x instead of H(x) directly

121 of 179

ResNet

Residual Block

Input x goes through conv-relu-conv series and gives us F(x). That result is then added to the original input x. Let’s call that H(x) = F(x) + x.

In traditional CNNs, H(x) would just be equal to F(x). So, instead of just computing that transformation (straight from x to F(x)), we’re computing the term that we have to add, F(x), to the input, x.

122 of 179

ResNet

Full ResNet architecture:

Stack residual blocks
Every residual block has two 3x3 conv layers
Periodically, double # of filters and downsample spatially using stride 2 (in each dimension)
Additional conv layer at the beginning
No FC layers at the end (only FC 1000 to output classes)

[He et al., 2015]

123 of 179

ResNet

Total depths of 34, 50, 101, or 152 layers for ImageNet
For deeper networks (ResNet-50+), use “bottleneck” layer to improve efficiency (similar to GoogLeNet)

[He et al., 2015]

124 of 179

ResNet

Experimental Results:

Able to train very deep networks without degrading
Deeper networks now achieve lower training errors as expected

The best CNN architecture that we currently have and is a great innovation for the idea of residual learning.

Even better than human performance!

125 of 179

ILSVRC winners

126 of 179

ARCHITECTURES FOR ADVANCED APPLICATIONS

127 of 179

Computer Vision Tasks

Classification

Semantic Segmentation

Object Detection

Instance Segmentation

CAT

GRASS, CAT, TREE, SKY

DOG, DOG, CAT

No spatial extent

Multiple Object

No objects, just pixels

May 20, 2021

128 of 179

Semantic Segmentation

128

Lecture 15 -

Classification

Semantic Segmentation

Object Detection

Instance Segmentation

CAT

GRASS, CAT, TREE, SKY

DOG, DOG, CAT

No spatial extent

Multiple Object

No objects, just pixels

May 20, 2021

129 of 179

Semantic Segmentation: The Problem

GRASS, CAT, TREE, SKY, ...

Paired training data: for each training image, each pixel is labeled with a semantic category.

At test time, classify each pixel of a new image.

?

130 of 179

Semantic Segmentation Idea: Sliding Window

Full image

?

Impossible to classify without context

Q: how do we include context?

131 of 179

Semantic Segmentation Idea: Sliding Window

Full image

Extract patch

Classify center pixel with CNN

Cow

Grass

Cow

Problem: Very inefficient! Not reusing shared features between overlapping patches

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

132 of 179

Semantic Segmentation Idea: Convolution

Full image

An intuitive idea: encode the entire image with conv net, and do semantic segmentation on top.

Problem: classification architectures often reduce feature spatial sizes to go deeper, but semantic segmentation requires the output size to be the same as input size.

133 of 179

Semantic Segmentation Idea: Fully Convolutional

Input:

3 x H x W

Convolutions: D x H x W

Conv

Scores: C x H x W

argmax

Predictions: H x W

Design a network with only convolutional layers without downsampling operators to make predictions for pixels all at once!

134 of 179

Semantic Segmentation Idea: Fully Convolutional

Input:

3 x H x W

Convolutions: D x H x W

Conv

Scores: C x H x W

argmax

Predictions: H x W

Design a network with only convolutional layers without downsampling operators to make predictions for pixels all at once!

Problem: convolutions at original image resolution will be very expensive ...

135 of 179

Semantic Segmentation Idea: Fully Convolutional

Input:

3 x H x W

Predictions: H x W

Design network as a bunch of convolutional layers, with

downsampling and upsampling inside the network!

High-res: D₁x H/2 x W/2

1

High-res:

D x H/2 x W/2

Med-res: D₂x H/4 x W/4

Low-res: D₃x H/4 x W/4

C x H x W

136 of 179

Semantic Segmentation Idea: Fully Convolutional

Input:

3 x H x W

Predictions: H x W

Design network as a bunch of convolutional layers, with

downsampling and upsampling inside the network!

High-res: D₁x H/2 x W/2

Med-res: D₂x H/4 x W/4

Low-res: D₃x H/4 x W/4

Downsampling: Pooling, strided convolution

Upsampling:

???

C x H x W

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015

Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

High-res:

D x H/2 x W/2

137 of 179

OBJECT DETECTION

138 of 179

Object Detection

Classification

Semantic Segmentation

Object Detection

Instance Segmentation

GRASS, CAT, TREE, SKY

CAT

DOG, DOG, CAT

Multiple Object

No spatial extent

No objects, just pixels

139 of 179

Object Detection: Single Object

(Classification + Localization)

Class Scores

Cat: 0.9

Dog: 0.05

Car: 0.01

...

This image is CC0 public domain

Vector:

4096

Fully Connected: 4096 to 1000

Box Coordinates (x, y, w, h)

Fully Connected: 4096 to 4

x, y

h

w

140 of 179

Object Detection: Single Object

(Classification + Localization)

140

Lecture 15 -

Class Scores

Cat: 0.9

Dog: 0.05

Car: 0.01

...

Vector:

4096

Fully Connected: 4096 to 1000

Box Coordinates

Fully Connected: 4096 to 4

Softmax Loss

L2 Loss

Loss

Correct label:

Cat

+

Multitask Loss

This image is CC0 public domain

x, y

h

w

(x, y, w, h)

Treat localization as a regression problem!

Correct box: (x’, y’, w’, h’)

May 20, 2021

141 of 179

Each image needs a different number of outputs!

141

CAT: (x, y, w, h)

DOG: (x, y, w, h)

CAT: (x, y, w, h)

DUCK: (x, y, w, h)

….

May 20, 2021

4 numbers

12 numbers

Many numbers!

Object Detection: Multiple Objects

142 of 179

Object Detection: Multiple Objects

Fei-Fei Li, Ranjay Krishna, Danfei Xu

142

Lecture 15 -

May 20, 2021

Apply a CNN to many different crops of the image, CNN classifies each crop as object or background

Problem: Need to apply CNN to huge number of locations, scales, and aspect ratios, very computationally expensive!

Dog? NO �Cat? YES

Background? NO

143 of 179

Region Proposals: Selective Search

Fei-Fei Li, Ranjay Krishna, Danfei Xu

143

Lecture 15 -

Find “blobby” image regions that are likely to contain objects
Relatively fast to run; e.g. Selective Search gives 2000 region proposals in a few seconds on CPU

May 20, 2021

Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012 Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013

Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014 Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014

144 of 179

“Slow” R-CNN

144

Lecture 15 -

May 20, 2021

Warped image regions (224x224 pixels)

(RoI) from a proposal method (~2k)

Forward each region through ConvNet

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.

Classify regions with SVMs

Input image

ConvN et

SVMs

Bbox reg

Predict “corrections” to the RoI: 4 numbers: (dx, dy, dw, dh)

145 of 179

“Slow” R-CNN

145

Lecture 15 -

May 20, 2021

Warped image regions (224x224 pixels)

(RoI) from a proposal method (~2k)

Forward each region through ConvNet

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.

Classify regions with SVMs

Input image

ConvN et

SVMs

Bbox reg

Predict “corrections” to the RoI: 4 numbers: (dx, dy, dw, dh)

Problem: Very slow! Need to do ~2k independent forward passes for each image!

Idea: Pass the image through convnet before cropping! Crop the conv feature instead!

146 of 179

Fast R-CNN

Fei-Fei Li, Ranjay Krishna, Danfei Xu

146

“Slow” R-CNN

May 20, 2021

ConvNet

Input image

Run whole image through ConvNet

“conv5” features

Crop + Resize features

Linear + softmax

CNN

Per-Region Network

Object category

Linear

Box offset

Girshick, “Fast R-CNN”, ICCV 2015.

Regions of Interest (RoIs)

from a proposal

method

“Backbone” network: AlexNet, VGG, ResNet, etc

147 of 179

Object Detection: Faster R-CNN

Fei-Fei Li, Ranjay Krishna, Danfei Xu

147

Lecture 15 -

May 20, 2021

ROI pooling / ROI Alignment creates fixed size feature maps from convolutional maps of the ROI.

148 of 179

Instance Segmentation: Mask R-CNN

Fei-Fei Li, Ranjay Krishna, Danfei Xu

148

Lecture 15 -

Mask Prediction

May 20, 2021

He et al, “Mask R-CNN”, ICCV 2017

Add a small mask network that operates on each RoI and predicts a 28x28 binary mask

149 of 179

Mask R-CNN: Very Good Results!

Fei-Fei Li, Ranjay Krishna, Danfei Xu

149

Lecture 15 -

He et al, “Mask R-CNN”, ICCV 2017

May 20, 2021

150 of 179

Summary: Lots of computer vision tasks!

150

Lecture 15 -

Classification

Semantic Segmentation

Object Detection

Instance Segmentation

CAT

GRASS, CAT, TREE, SKY

DOG, DOG, CAT

No spatial extent

Multiple Object

No objects, just pixels

May 20, 2021

151 of 179

References

CS231n: Deep Learning for Computer Vision. �https://cs231n.stanford.edu/ �Lecture slides, and lecture videos�
NPTEL course: Deep Learning for Computer Vision�Vineeth N Balasubramanian�https://onlinecourses.nptel.ac.in/noc20_cs88/preview

152 of 179

AUTOENCODERS

153 of 179

Autoencoders

Unsupervised Learning Algorithm
Given an input x, we learn a compressed representation of the input, which we then try to reconstruct
In the simpliest form: Feed forward network with hidden size < input size.
We then search for parameters such that:��for all training examples
The error function is:��
Once we finished training, we are interested in the compressed representation, i.e. the values of the hidden units

Source: http://ufldl.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity

154 of 179

Why would we use autoencoders?

How does a randomly generated image look like?

155 of 179

Why would we use autoencoders?

What would be the probability to get an image like this from random sampling?

02.09.2014 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |

156 of 179

Why would we use autoencoders?

Produce a compressed representation of a high-dimensional input (for example images)
The compression is lossy. Learning drives the encoder to be a good compression in particular for training examples
For random input, the reconstruction error will be high
The autoencoder learns to abstract properties from the input. What defines a natural image? Color gradients, straight lines, edges etc.
The abstract representation of the input can make a further classification task much easier

157 of 179

Dimension-Reduction can simplify classifcation tasks – MNIST Task

158 of 179

Dimension-Reduction can simplify classifcation tasks – MNIST Task

Histogram-plot of test error on the MNIST hand written digit recognition.
Comparison of neural network with and without pretraining

Source: Erhan et al, 2010, Why Does Unsupervised Pre-training Help Deep Learning?

159 of 179

Autoencoders vs. PCA

Principal component analysis (PCA) converts a set of correlated variables to a set of linearly uncorrelated variables called principal components
PCA is a standard method to break down high-dimensional vector spaces, e.g. for information extraction or visualization
However, PCA can only capture linear correlations

Autoencoders

Encoder:��Decoder:��

160 of 179

Autoencoders vs. PCA - Example

LSA

Deep Autoencoder

Articles from Reuter corpus were mapped to a 2000 dimensional vector, using the 2000 most common word stems

Source: Hinton et al., Reducing the Dimensionality of Data with Neural Networks

161 of 179

How to ensure the encodes does not learn the identity function?

Identify Function

Learning the identity function would not be helpful
Different approaches to ensure this:

Bottleneck constraint: The hidden layer is (much) smaller than the input layer
Sparse coding: Forcing many hidden units to be zero or near zero
Denoising encoder: Add randomness to the input and/or the hidden values

Denoising Encoder

Create some random noise
Compute
Reconstruction Error:

Alternatively: Set some of the neurons (e.g. 50%) to zero

The noise forces the hidden layer to learn more robust features

162 of 179

Stacking Autoencoders

We can stack multiple hidden layers to create a deep autoencoder
These are especially suitable for highly non-linear tasks
The layers are trained layer-wise – one at a time

Step 1: Train single layer autoencoder until convergence

163 of 179

Stacking Autoencoders

Step 2: Add additional hidden layer and train this layer by trying to reconstruct the output of the previous hidden layer. Previous layers are will not be changed. Error function: .

164 of 179

Stacking Autoencoders – Fine-tuning

After pretraining all hidden layers, the deep autoencoder is fine-tuned

Unsupervised Fine-Tuning:

Apply back propagation to the complete deep autoencoder
Error-Function:

Further details, see Hinton et al.
(It appears that supervised fine-tuning is more common nowadays)

Supervised Fine-Tuning:

Use your classification task to fine-tune your autoencoders
A softmax-layer is added after the last hidden layer
Weights are tuned by using back prograpagtion.
See next slides for an example or http://ufldl.stanford. edu/wiki/index.php/Stacked_Autoencoders

165 of 179

Stacking Autoencoders - Example

Pretrain first autoencoder

Train an autoencoder to get the first weight matrix and first bias vector
The second weight matrix, connecting the hidden and the output units, will be disregarded after the first pretraining step
Stop after a certain number of iterations

Source: http://ufldl.stanford. edu/wiki/

166 of 179

Stacking Autoencoders - Example

Pretrain second autoencoder

Use the values of the previous hidden units as input for the next autoencoder.
Train as before

Source: http://ufldl.stanford. edu/wiki/

167 of 179

Stacking Autoencoders - Example

Pretrain softmax layer

After second pretraining finishes, add a softmax layer for your classification task
Pretrain this layer using back propagation

Source: http://ufldl.stanford. edu/wiki/

168 of 179

Stacking Autoencoders - Example

Fine-tuning

Plug all layers together
Compute the costs based on the actual input
Update all weights using backpropagation

Source: http://ufldl.stanford. edu/wiki/

169 of 179

Is pre-training really necessary?

Xavier Glorot and Yoshua Bengio, 2010, Understanding the difficulty of training deep feedforward neural networks
With the right activation function and initialization, the importance of pre-training decreases

170 of 179

Is pre-training really necessary?

Pre-training achieves two things:

It makes optimization easier
It reduces overfitting�

Pre-training is not required to make optimization work, if you have enough data

Mainly due to a better understanding how initialization works�

Pre-training is still very effective on small datasets

More information:�https://www.youtube.com/watch?v=vShMxxqtDDs

171 of 179

Dropout in Neural Networks

171

26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |

Inspired by Hinton https://www.youtube.com/watch?v=vShMxxqtDDs

For details:

Srivastava, Hinton et al., 2014, Dropout: A Simple Way to Prevent Neural

Networks from Overtting

172 of 179

Ensemble Learning

26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |

Create many different models and combine them at test time to make

prediction

Averaging over different models is very effective against overfitting

Random Forest

A single decision trees is not very powerful
Creating hundreds of different trees and combine them

Random forests works really well

Several Kaggle competitions, e.g. Netflix, were won by random forests

173 of 179

Model Averaging with Neural Nets

26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |

We would like to do massive model averaging

Average over 100, 1.000, 10.000 or 100.000 models

Each net takes a long time to train

We don’t have enough time to learn so many models

At test time, we don’t want to run lots of large neural nets

We need something that is more efficient

Use dropouts!

174 of 179

Dropout

Img source: http://cs231n.github.io/

Each time present a training example, we dropout 50% of the hidden

units

With this, we randomly sample over 2^Hdifferent architectures

H: Number of hidden units

All architectures share the same weights

26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |

175 of 179

Dropout

26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |

With H hidden units, we sample from 2^Hdifferent models

Only few of the models get ever trained and they only get 1 training example

Sharing of weights means that every model is strongly

regularized

Much better than L1 and L2 regularization, which pulls weights towards zero
It pulls weights towards what other models need
Weights are pulled towards sensible values

This works in experiments extremely well

176 of 179

Dropout – at test time

26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |

We could sample many different architectures and average the output

This would be way too slow

Instead: Use all hidden units and half their outgoing weights

Computes the geometric mean of the prediction of all 2^Hmodels
We can use other dropout rates than p=0.5. At test time, multiply weights by 1-p

Using this trick, we train and use trillions of “different” models

For the input layer:

We could apply dropout also to the input layer
The probability should be then smaller than 0.5
This is known as denoising autoencoder
Currently this cannot be implemented in out-of-the-box Keras

177 of 179

How well does dropout work?

Source: Srivastava et al, 2014, Drouput A Simple Way to Prevent Neural Networks from Overtting

26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |

Classification error on MNIST dataset

178 of 179

How well does dropout work?

If your deep neural network is significantly overfitting, dropout will reduce the number of errors a lot

If your deep neural network is not overfitting, you should be using a bigger one

Our brain: #parameters >> #experiences
Synapses are much cheaper then experiences

26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |

179 of 179

Another way to think about Dropout

179

26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |

In a fully connected neural network, a hidden unit knows

which other hidden units are present

The hidden unit co-adapt with them for the training data
But big, complex conspiracies are not robust -> they fail at test time

In the dropout scenario, each unit has to work with different sets of co-workers

It is likely that the hidden unit does something individually useful
It still tries to be different from its co-workers