CS60050: Machine Learning��Neural Networks
Sourangshu Bhattacharya
Neural Network Basics
�
McCulloch–Pitts “unit”
3
Output is a “squashed” linear function of the inputs:
Output
Σ
Input Links
Activation Function
Input Function
Output Links
a0 = −1
ai = g(ini)
ai
g
ini
Wj,i
W
0,i
Bias Weight
aj
A gross oversimplification of real neurons, but its purpose is
to develop understanding of what networks of simple units can do
Activation functions
+1
+1
ini
ini
g(ini) g(ini)
(a) (b)
Changing the bias weight W0,i moves the threshold location
Feed forward example
Expressiveness of perceptrons
Feed Forward Neural Networks
Hidden-Layer
How to learn the weights
Neural Networks
What is a Neuron?
McCulloch–Pitts “unit”
12
Output is a “squashed” linear function of the inputs:
Output
Σ
Input Links
Activation Function
Input Function
Output Links
a0 = −1
ai = g(ini)
ai
g
ini
Wj,i
W
0,i
Bias Weight
aj
A gross oversimplification of real neurons, but its purpose is
to develop understanding of what networks of simple units can do
Activation functions
+1
+1
ini
ini
g(ini) g(ini)
(a) (b)
Changing the bias weight W0,i moves the threshold location
The purpose of the activation function is to add a non-linear transformation and �in some cases squash the output to a specified range.
Commonly used Activation functions
Perceptron
Perceptron: a 1-layer NN
Task: Classification
Perceptron: a 1-layer NN
Expressiveness of Perceptrons
Multi-layer Perceptrons
Feed Forward Neural Networks
Feed forward Neural Network Computation
Hidden-Layer
Solution to the Xor problem
Composition of Transformations
Distributed feature representation
Example: A useful feature transformation
Distributed feature representation
Feed forward Neural Network Example
A Toy Neural Network
Feed forward Neural Network Example
Hidden-Layer
Deep Neural Networks
Deep Neural Networks – hierarchical features
Shallow to Deep Neural Networks
TRAINING A NEURAL NETWORK
How to learn the weights ?
Input
(Feature Vector)
Output
(Label)
How to Train a Neural Net?
How to Train a Neural Net?
How to Train a Neural Net?
Forward Propagation
Pass in
Input
Forward Propagation
Calculate each Layer
Forward Propagation
Get Output
Forward Propagation
Forward Propagation
How to Train a Neural Net?
How to Train a Neural Net?
How to Train a Neural Net?
Feedforward Neural Network
Want:
Want:
Backward Pass
Backward Pass
Backward Pass
Backward Pass
Backpropagation Formula
How to Train a Neural Net?
Going from Shallow to Deep Neural Networks
Computational Graph
Definition: a data structure for storing gradients of variables used in computations.
Backpropagation for neural nets
Given softmax activation, L2 loss, a point (x1, x2, x3, y) = (0. 1, 0.15, 0.2, 1), compute the gradient
Backpropagation for neural nets: forward pass
Backpropagation for neural nets: backward pass
Computation Graphs
CONVOLUTIONAL NEURAL NETWORKS
Motivation – Image Data
Motivation
Motivation – Image Data
Motivation
Kernels
Kernel: 3x3 Example
Input
Kernel
Output
3
2
1
1
2
3
1
1
1
-1
0
1
-2
0
2
-1
0
1
Kernel: 3x3 Example
3
2
1
1
2
3
1
1
1
-1
0
1
-2
0
2
-1
0
1
Output
Kernel: 3x3 Example
Input
Kernel
Output
3
2
1
1
2
3
1
1
1
-1
0
1
-2
0
2
-1
0
1
2
Kernel: Example
Kernels as Feature Detectors
Can think of kernels as a ”local feature detectors”
Vertical Line Detector
-1
1
-1
-1
1
-1
-1
1
-1
Horizontal Line Detector
-1
-1
-1
1
1
1
-1
-1
-1
Corner Detector
-1
-1
-1
-1
1
1
-1
1
1
Convolutional Neural Nets
Primary Ideas behind Convolutional Neural Networks:
Convolutions
Convolution Settings – Grid Size
Grid Size (Height and Width):
Height: 3, Width: 3
Height: 1, Width: 3
Height: 3, Width: 1
Convolution Settings - Padding
Padding
Without Padding
With Padding
Convolution Settings
Stride
Stride 2 Example – No Padding
3
0
Stride 2 Example – With Padding
-1
2
3
Convolutional Settings - Depth
Convolutional Settings - Depth
Pooling
Pooling: Max-pool
Pooling: Average-pool
ConvNet: CONV, RELU, POOL �and FC Layers
Convolution Layer
Convolution Layer
consider a second, green filter
Convolution Layer
ReLU (Rectified Linear Units) Layer
A Basic ConvNet
What is convolution of an image with a filter
Details about the convolution layer
Details about the convolution layer
Details about the convolution layer
Convolution layer examples
Pooling Layer
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Lecture 7 -
100
Where ReLu is used as f.
Convolutional Neural Networks
+ ReLu
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Lecture 7 -
101
Kernel= [1,0,1
0,1,0
1,0,1]
Convolutional Neural Networks
1 | 0 | 1 |
0 | 1 | 0 |
1 | 0 | 1 |
Applications
Applications
ConvNet: CONV, RELU, POOL and FC Layers
Pytorch Implementation
ConvNet: CONV, RELU, POOL and FC Layers
Pytorch Implementation
EVOLUTION OF MODEL ARCHITECURES
ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
ILSVRC
AlexNet
Architecture
CONV1
MAX POOL1 NORM1 CONV2
MAX POOL2 NORM2 CONV3 CONV4 CONV5
Max POOL3 FC6
FC7 FC8
(N-F)/s+1 = (227-11)/4+1 = 55 -> [55x55x96]
AlexNet
Details/Retrospectives:
ILSVRC winners
VGGNet
Only 3x3 CONV filters, stride 1, pad 1 and 2x2 MAX POOL , stride 2
AlexNet: 8 layers
VGGNet: 16 - 19 layers
Input
3x3 conv, 64
3x3 conv, 64 Pool 1/2
3x3 conv, 128
3x3 conv, 128 Pool 1/2
3x3 conv, 256
3x3 conv, 256 Pool 1/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512 Pool 1/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512 Pool 1/2
FC 4096
FC 4096
FC 1000
Softmax
VGGNet
[Simonyan and Zisserman, 2014]
Stack of three 3x3 conv (stride 1) layers has the same effective receptive field as one 7x7 conv layer.
7x7
But deeper, more non-linearities
And fewer parameters: 3 * (32C2) vs. 72C2 for C channels per layer
ILSVRC winners
GoogleNet
GoogleNet
“Inception module”: design a good local network topology (network within a network) and then stack these modules on top of each other
[Szegedy et al., 2014]
Filter
concatenation
Previous layer
1x1
convolution
3x3
convolution
5x5
convolution
1x1
convolution
1x1
convolution
1x1
convolution
3x3 max
pooling
ILSVRC winners
ResNet
[He et al., 2015]
ResNet
-> The deeper model performs worse (not caused by overfitting)!
ResNet
ResNet
Residual Block
Input x goes through conv-relu-conv series and gives us F(x). That result is then added to the original input x. Let’s call that H(x) = F(x) + x.
In traditional CNNs, H(x) would just be equal to F(x). So, instead of just computing that transformation (straight from x to F(x)), we’re computing the term that we have to add, F(x), to the input, x.
ResNet
Full ResNet architecture:
[He et al., 2015]
ResNet
[He et al., 2015]
ResNet
Experimental Results:
The best CNN architecture that we currently have and is a great innovation for the idea of residual learning.
Even better than human performance!
ILSVRC winners
ARCHITECTURES FOR ADVANCED APPLICATIONS
Computer Vision Tasks
Classification
Semantic Segmentation
Object Detection
Instance Segmentation
CAT
GRASS, CAT, TREE, SKY
DOG, DOG, CAT
DOG, DOG, CAT
No spatial extent
Multiple Object
No objects, just pixels
May 20, 2021
Semantic Segmentation
128
Lecture 15 -
Classification
Semantic Segmentation
Object Detection
Instance Segmentation
CAT
GRASS, CAT, TREE, SKY
DOG, DOG, CAT
DOG, DOG, CAT
No spatial extent
Multiple Object
No objects, just pixels
May 20, 2021
Semantic Segmentation: The Problem
GRASS, CAT, TREE, SKY, ...
Paired training data: for each training image, each pixel is labeled with a semantic category.
At test time, classify each pixel of a new image.
?
Semantic Segmentation Idea: Sliding Window
Full image
?
Impossible to classify without context
Q: how do we include context?
Semantic Segmentation Idea: Sliding Window
Full image
Extract patch
Classify center pixel with CNN
Cow
Grass
Cow
Problem: Very inefficient! Not reusing shared features between overlapping patches
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Semantic Segmentation Idea: Convolution
Full image
An intuitive idea: encode the entire image with conv net, and do semantic segmentation on top.
Problem: classification architectures often reduce feature spatial sizes to go deeper, but semantic segmentation requires the output size to be the same as input size.
Semantic Segmentation Idea: Fully Convolutional
Input:
3 x H x W
Convolutions: D x H x W
Conv
Conv
Conv
Conv
Scores: C x H x W
argmax
Predictions: H x W
Design a network with only convolutional layers without downsampling operators to make predictions for pixels all at once!
Semantic Segmentation Idea: Fully Convolutional
Input:
3 x H x W
Convolutions: D x H x W
Conv
Conv
Conv
Conv
Scores: C x H x W
argmax
Predictions: H x W
Design a network with only convolutional layers without downsampling operators to make predictions for pixels all at once!
Problem: convolutions at original image resolution will be very expensive ...
Semantic Segmentation Idea: Fully Convolutional
Input:
3 x H x W
Predictions: H x W
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
High-res: D1 x H/2 x W/2
1
High-res:
D x H/2 x W/2
Med-res: D2 x H/4 x W/4
Med-res: D2 x H/4 x W/4
Low-res: D3 x H/4 x W/4
C x H x W
Semantic Segmentation Idea: Fully Convolutional
Input:
3 x H x W
Predictions: H x W
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
High-res: D1 x H/2 x W/2
Med-res: D2 x H/4 x W/4
Med-res: D2 x H/4 x W/4
Low-res: D3 x H/4 x W/4
Downsampling: Pooling, strided convolution
Upsampling:
???
C x H x W
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
High-res:
D x H/2 x W/2
OBJECT DETECTION
Object Detection
Classification
Semantic Segmentation
Object Detection
Instance Segmentation
GRASS, CAT, TREE, SKY
CAT
DOG, DOG, CAT
DOG, DOG, CAT
Multiple Object
No spatial extent
No objects, just pixels
Object Detection: Single Object
(Classification + Localization)
Class Scores
Cat: 0.9
Dog: 0.05
Car: 0.01
...
Vector:
4096
Fully Connected: 4096 to 1000
Box Coordinates (x, y, w, h)
Fully Connected: 4096 to 4
x, y
h
w
Object Detection: Single Object
(Classification + Localization)
140
Lecture 15 -
Class Scores
Cat: 0.9
Dog: 0.05
Car: 0.01
...
Vector:
4096
Fully Connected: 4096 to 1000
Box Coordinates
Fully Connected: 4096 to 4
Softmax Loss
L2 Loss
Loss
Correct label:
Cat
+
Multitask Loss
x, y
h
w
(x, y, w, h)
Treat localization as a regression problem!
Correct box: (x’, y’, w’, h’)
May 20, 2021
Each image needs a different number of outputs!
141
CAT: (x, y, w, h)
DOG: (x, y, w, h)
DOG: (x, y, w, h)
CAT: (x, y, w, h)
DUCK: (x, y, w, h)
DUCK: (x, y, w, h)
….
May 20, 2021
4 numbers
12 numbers
Many numbers!
Object Detection: Multiple Objects
Object Detection: Multiple Objects
Fei-Fei Li, Ranjay Krishna, Danfei Xu
142
Lecture 15 -
May 20, 2021
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Problem: Need to apply CNN to huge number of locations, scales, and aspect ratios, very computationally expensive!
Dog? NO �Cat? YES
Background? NO
Region Proposals: Selective Search
Fei-Fei Li, Ranjay Krishna, Danfei Xu
143
Lecture 15 -
May 20, 2021
Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012 Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013
Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014 Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014
“Slow” R-CNN
144
Lecture 15 -
May 20, 2021
Warped image regions (224x224 pixels)
(RoI) from a proposal method (~2k)
Forward each region through ConvNet
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
Classify regions with SVMs
Input image
ConvN et
ConvN et
ConvN et
SVMs
SVMs
SVMs
Bbox reg
Bbox reg
Bbox reg
Predict “corrections” to the RoI: 4 numbers: (dx, dy, dw, dh)
“Slow” R-CNN
145
Lecture 15 -
May 20, 2021
Warped image regions (224x224 pixels)
(RoI) from a proposal method (~2k)
Forward each region through ConvNet
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
Classify regions with SVMs
Input image
ConvN et
ConvN et
ConvN et
SVMs
SVMs
SVMs
Bbox reg
Bbox reg
Bbox reg
Predict “corrections” to the RoI: 4 numbers: (dx, dy, dw, dh)
Problem: Very slow! Need to do ~2k independent forward passes for each image!
Idea: Pass the image through convnet before cropping! Crop the conv feature instead!
Fast R-CNN
Fei-Fei Li, Ranjay Krishna, Danfei Xu
146
“Slow” R-CNN
May 20, 2021
ConvNet
Input image
Run whole image through ConvNet
“conv5” features
Crop + Resize features
Linear + softmax
CNN
Per-Region Network
Object category
Linear
Box offset
Girshick, “Fast R-CNN”, ICCV 2015.
Regions of Interest (RoIs)
from a proposal
method
“Backbone” network: AlexNet, VGG, ResNet, etc
Object Detection: Faster R-CNN
Fei-Fei Li, Ranjay Krishna, Danfei Xu
147
Lecture 15 -
May 20, 2021
ROI pooling / ROI Alignment creates fixed size feature maps from convolutional maps of the ROI.
Instance Segmentation: Mask R-CNN
Fei-Fei Li, Ranjay Krishna, Danfei Xu
148
Lecture 15 -
Mask Prediction
May 20, 2021
He et al, “Mask R-CNN”, ICCV 2017
Add a small mask network that operates on each RoI and predicts a 28x28 binary mask
Mask R-CNN: Very Good Results!
Fei-Fei Li, Ranjay Krishna, Danfei Xu
149
Lecture 15 -
He et al, “Mask R-CNN”, ICCV 2017
May 20, 2021
Summary: Lots of computer vision tasks!
150
Lecture 15 -
Classification
Semantic Segmentation
Object Detection
Instance Segmentation
CAT
GRASS, CAT, TREE, SKY
DOG, DOG, CAT
DOG, DOG, CAT
No spatial extent
Multiple Object
No objects, just pixels
May 20, 2021
References
AUTOENCODERS
Autoencoders
Autoencoders
Source: http://ufldl.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity
Why would we use autoencoders?
Why would we use autoencoders?
02.09.2014 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |
Why would we use autoencoders?
Dimension-Reduction can simplify classifcation tasks – MNIST Task
Dimension-Reduction can simplify classifcation tasks – MNIST Task
Source: Erhan et al, 2010, Why Does Unsupervised Pre-training Help Deep Learning?
Autoencoders vs. PCA
Autoencoders
Encoder:���Decoder:��
Autoencoders vs. PCA - Example
LSA
Deep Autoencoder
Source: Hinton et al., Reducing the Dimensionality of Data with Neural Networks
How to ensure the encodes does not learn the identity function?
Identify Function
Denoising Encoder
Stacking Autoencoders
Step 1: Train single layer autoencoder until convergence
Stacking Autoencoders
Step 2: Add additional hidden layer and train this layer by trying to reconstruct the output of the previous hidden layer. Previous layers are will not be changed. Error function: .
Stacking Autoencoders – Fine-tuning
Unsupervised Fine-Tuning:
Supervised Fine-Tuning:
Stacking Autoencoders - Example
Pretrain first autoencoder
Source: http://ufldl.stanford. edu/wiki/
Stacking Autoencoders - Example
Pretrain second autoencoder
Source: http://ufldl.stanford. edu/wiki/
Stacking Autoencoders - Example
Pretrain softmax layer
Source: http://ufldl.stanford. edu/wiki/
Stacking Autoencoders - Example
Fine-tuning
Source: http://ufldl.stanford. edu/wiki/
Is pre-training really necessary?
Is pre-training really necessary?
Dropout in Neural Networks
171
26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |
Inspired by Hinton https://www.youtube.com/watch?v=vShMxxqtDDs
For details:
Srivastava, Hinton et al., 2014, Dropout: A Simple Way to Prevent Neural
Networks from Overtting
Ensemble Learning
26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |
prediction
Model Averaging with Neural Nets
26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |
Dropout
Img source: http://cs231n.github.io/
units
26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |
Dropout
26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |
regularized
Dropout – at test time
26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |
How well does dropout work?
Source: Srivastava et al, 2014, Drouput A Simple Way to Prevent Neural Networks from Overtting
26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |
Classification error on MNIST dataset
How well does dropout work?
26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |
Another way to think about Dropout
179
26.10.2015 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Nils Reimers |
which other hidden units are present