1 of 83

ME 5990

�

Image Recognition�Convolution, Pooling

Partial Slide by Gang Hua, Wompex AI

2 of 83

Outline

Background of computer vision
Image and Convolution of Image
Tensor and Convolution over Volume
Pooling
Typical network structure
Prominent Structure

3 of 83

AlexNet

[0.86, 0.14]

[0.14, 0.86]

Loss Function

Training Labels

Loss

4 of 83

Special effects: shape and motion capture

Source: S. Seitz

5 of 83

3D visualization: Microsoft Photosynth

http://photosynth.net

Source: S. Seitz

6 of 83

Optical character recognition (OCR)

Digit recognition

License plate readers

Sudoku grabber

Automatic check processing

Source: S. Seitz

7 of 83

Biometrics

Fingerprint scanners on many new laptops, �other devices

Face recognition systems now beginning to appear more widely�http://www.sensiblevision.com/

Source: S. Seitz

8 of 83

Biometrics

How the Afghan Girl was Identified by Her Iris Patterns

Source: S. Seitz

9 of 83

Mobile visual search

Point & Find, Nokia

Lincoln, Microsoft Research

10 of 83

Face detection

Many new digital cameras now detect faces

Canon, Sony, Fuji, …

Source: S. Seitz

11 of 83

Smile detection

Sony Cyber-shot® T70 Digital Still Camera

Source: S. Seitz

12 of 83

Face annotation

Apple iPhoto

Google Picasa

Windows Live Photo Gallery

13 of 83

Automotive safety

Mobileye: Vision systems in high-end BMW, GM, Volvo models

Pedestrian collision warning
Forward collision warning
Lane departure warning
Headway monitoring and warning

Source: A. Shashua, S. Seitz

14 of 83

Vision for robotics, space exploration

Vision systems (JPL) used for several tasks

Panorama stitching
3D terrain modeling
Obstacle detection, position tracking
For more, read “Computer Vision on Mars” by Matthies et al.

NASA'S Mars Exploration Rover Spirit captured this westward view from atop �a low plateau where Spirit spent the closing months of 2007.

Source: S. Seitz

15 of 83

Outline

Background of computer vision
Image and Convolution of Image
Tensor and Convolution over Volume
Pooling
Typical network structure
Prominent Structure

16 of 83

Image Format: Very Brief Idea

Most of color images consists of three channels: R. G. B.

Most of image formats (jpeg, tiff, etc.) are compressions of R.G.B. images

https://pursuit.unimelb.edu.au/articles/it-s-time-to-retire-lena-from-computer-science

https://www.geeksforgeeks.org/matlab-rgb-image-representation/

17 of 83

Image Processing: Convolution

Very old motivation: how can we denoise of an image?

18 of 83

Convolution: previous we learnt

19 of 83

Convolution

Let f be the image and g be the kernel. The output of convolving f with g is denoted f *g.

f

Source: F. Durand

20 of 83

Convolution

2D Convolution:

21 of 83

Demonstration

22 of 83

Demonstration

23 of 83

Annoying details

What is the size of the output?

f

g

f

g

f

g

full

same

valid

24 of 83

Annoying Details

Convolution is simple

But it has a lot of details

Reference reading:

https://www.baeldung.com/cs/convolutional-layer-size

25 of 83

Stride

We can jump

26 of 83

Padding

To maintain the size, we then can “pad”

27 of 83

The ultimate equation

28 of 83

Convolution Application

People design various kernels (g) for various purpose:
Shifting:

Averaging Blurring:

29 of 83

Convolution Application

People design various kernels (g) for various purpose:
Sharpening:

30 of 83

Sharpening

What does blurring take away?

original

smoothed (5x5)

–

detail

=

sharpened

=

Let’s add it back:

original

detail

+

31 of 83

Gaussian Filter

We can blur an image using Gaussian Filter

0.003 0.013 0.022 0.013 0.003

0.013 0.059 0.097 0.059 0.013

0.022 0.097 0.159 0.097 0.022

0.013 0.059 0.097 0.059 0.013

0.003 0.013 0.022 0.013 0.003

32 of 83

Gaussian Blur

Gaussian Blur vs. Box Blur

Box Blur

Gaussian Blur

33 of 83

Convolution for Gradient

Finite difference filters:

34 of 83

Edge Detection

Original

Canny Edge Detection

35 of 83

About Convolution

How do we know what kernel we should use? �
Is there anyway to let computers find the filters for us? (Train a filter)

Can we apply them on color image?

The dress: what color is it?

https://www.youtube.com/watch?v=n9fwiNyDHLI

36 of 83

Outline

Background of computer vision
Image and Convolution of Image
Tensor and Convolution over Volume
Pooling
Typical network structure
Prominent Structure

37 of 83

Tensor

In mathematics, a tensor is an algebraic object that describes a (multilinear) relationship between sets of algebraic objects related to a vector space.
You may consider a matrix a 2D tensor
You may consider an RGB image 3D tensor

Cauchy Stress Tensor

Image tensor, with dimension 255 x 255 x 3

38 of 83

Convolutions Over Volumes

2D Convolution can be applied over volume:

https://youtu.be/KTB_OFoAQcc

39 of 83

2D convolution on volume

If we have an RGB image with (255 x 255 x 3), which of the kernel dimension is valid for a 2d convolution of the image?

3 x 3 x 3
255 X 255 X 1
3 X 3 X 5
None of above

40 of 83

2D convolution on volume

If we have an RGB image with (255 x 255 x 3), which of the kernel dimension is valid for a 2d convolution of the image?

3 x 3 x 3
255 X 255 X 1
3 X 3 X 5
None of above

Solution: the 3^rd dimension of the input and output shall be the same

41 of 83

2D convolution on volume

If we have a tensor (255 x 255 x 10),

We want to apply “valid” type 2d convolution, (i.e. no padding)
If we apply 3 x 3 x n filter

What is the value of n?
What is the dimension of the output?

42 of 83

2D convolution on volume

43 of 83

Multiple Filters

We can apply multiple filters for the same dimension. We usually denote the number of filter as the 4^th dimension of the filter tensor

…

44 of 83

Multiple Filters

Another visualization of convolution

https://cs231n.github.io/convolutional-networks/

45 of 83

Multiple Filters

If we have a tensor 255 x 255 x 3, we want to apply a filter 3 x 3 x n x 64, using the “same” type (i.e., the output shall have the same width and height of the input), with stride = 1
What is the value of n?
What is the size of padding?
What is the output dimension?

46 of 83

Multiple Filters

47 of 83

Bias

Usually when we say we have a “convolution layer”, we will have a convolution mask and a bias value (b) for each filter
If we have n filters, we will have n bias

48 of 83

No. of parameters

If we have an image with dimension 255 x 255 x 3, assume we use “valid” type convolution. The kernel dimension is 3 x 3 x 3 x 64.
How many parameters are there in this filter? (Including bias)

3
3 x 3 x 3 = 27
3 x 3 x 3 x 64 = 1728
3 x 3 x 3 x 64 + 64 = 1792

49 of 83

No. of parameters

If we have an image with dimension 255 x 255 x 3, assume we use “valid” type convolution. The kernel dimension is 3 x 3 x 3 x 64.
How many parameters are there in this filter? (Including bias)

3
3 x 3 x 3 = 27
3 x 3 x 3 x 64 = 1728
3 x 3 x 3 x 64 + 64 = 1792

Solution: D

50 of 83

Summary

51 of 83

Summary

52 of 83

Outline

Background of computer vision
Image and Convolution of Image
Tensor and Convolution over Volume
Pooling
Typical network structure
Prominent Structure

53 of 83

Pooling Function

Reading: CS231n Convolutional Neural Networks for Visual Recognition
A common way to reduce the dimension of a tensor

Max pooling: choose the largest among the filter for each channel each stride
Average pooling: choose the average among the filter for each channel each stride

https://cs231n.github.io/convolutional-networks/#pool

54 of 83

Max pooling

Pooling always is applied on one slice (one channel)

55 of 83

Pooling function

56 of 83

Outline

Background of computer vision
Image and Convolution of Image
Tensor and Convolution over Volume
Pooling
Typical network structure
Prominent Structure

57 of 83

Typical convolutional neural network

Convolution layer for many times to “extract feature”
Pooling layer for many times to “downsize the sample”
(It is common to repeat convolution – pooling)
Fully connected layer for many times to “train the weight”

Lots of hyperparameters to tune
Tons of parameters to be trained (luckily not by you)

58 of 83

Typical network arrangement

Activation

…

Repeat 2d convolution and activation

…

Repeat pooling and activation

Activation

…

Repeat FC and activation

Flatten

Fully-connected

Activation

Softmax

Output: one-hot

59 of 83

Flatten Layer

Simply convert a tensor in any dimension to a vector

60 of 83

Network Dimension

While ChatGPT can help you figure out dimension issue, you wish to know it yourself
If we have an image with dimension to be 227 x 227 x 3
We go through the network to the right, figure out the the values of all ‘?’ markers
How many parameters are in this model?

Input 227 x 227 x 3

h1: ? x ? x ?

conv1: 11 x 11 x 3 x 96�stride: 4

padding: 0 �ReLU activation

h2: ? x ? x ?

conv2: 5 x 5 x ? x 96�stride: 8

padding: 2 �ReLU activation

h3: ? x ? x ?

Max pooling 2 x 2

Stride: 2�ReLU activation

…

61 of 83

Network Dimension

Input 227 x 227 x 3

h1: ? x ? x ?

conv1: 11 x 11 x 3 x 96�stride: 4

padding: 0 �ReLU activation

h2: ? x ? x ?

conv2: 5 x 5 x ? x 96�stride: 8

padding: 2 �ReLU activation

h3: ? x ? x ?

Max pooling 2 x 2

Stride: 2�ReLU activation

h4: ? x ? x ?

Conv 3: 3 x 3 x ? x 48

padding: 0

Stride: 1�ReLU activation

h5: ? x ?

flatten

h6: ? x ?

FC: ? x 1024�Sigmoid

FC: ? x ?�Sigmoid

Output before softmax h7: ? x ?

Output in one-hot: ? x ?

Softmax

62 of 83

Network Dimension

Input 227 x 227 x 3

h1: 55 x 55 x 96

conv1: 11 x 11 x 3 x 96�stride: 4

padding: 0 �ReLU activation

h2: ? x ? x ?

conv2: 5 x 5 x ? x 96�stride: 8

padding: 2 �ReLU activation

h3: ? x ? x ?

Max pooling 2 x 2

Stride: 2�ReLU activation

…

63 of 83

Network Dimension

Input 227 x 227 x 3

h1: 55 x 55 x 96

conv1: 11 x 11 x 3 x 96�stride: 4

padding: 0 �ReLU activation

h2: 14 x 14 x 96

conv2: 5 x 5 x ? x 96�stride: 4

padding: 2 �ReLU activation

h3: ? x ? x ?

Max pooling 2 x 2

Stride: 2�ReLU activation

…

64 of 83

Network Dimension

Input 227 x 227 x 3

h1: 55 x 55 x 96

conv1: 11 x 11 x 3 x 96�stride: 4

padding: 0 �ReLU activation

h2: 14 x 14 x 96

conv2: 5 x 5 x ? x 96�stride: 4

padding: 2 �ReLU activation

h3: ? x ? x ?

Max pooling 2 x 2

Stride: 2�ReLU activation

…

65 of 83

Network Dimension

h3: 7 x 7 x 96

h4: 5 x 5 x 48

Conv 3: 3 x 3 x 96 x 48

padding: 0

Stride: 1�ReLU activation

h5: ? x ?

flatten

h6: ? x ?

FC: ? x 1024�Sigmoid

FC: ? x ?�Sigmoid

Output before softmax h7: ? x ?

Output in one-hot: ? x ?

Softmax

66 of 83

Network Dimension

h3: 7 x 7 x 96

h4: 5 x 5 x 48

Conv 3: 3 x 3 x 96 x 48

padding: 0

Stride: 1�ReLU activation

h5: 1200 x 1

flatten

h6: 1024 x 1

FC1: ? x 1024�Sigmoid

FC2: ? x ?�Sigmoid

Output before softmax h7: ? x ?

Output in one-hot: ? x ?

Softmax

67 of 83

Network Dimension

h3: 7 x 7 x 96

h4: 5 x 5 x 48

Conv 3: 3 x 3 x 96 x 48

padding: 0

Stride: 1�ReLU activation

h5: 1200 x 1

flatten

h6: 1024 x 1

FC1: 1200 x 1024�Sigmoid

FC2: 1000 x 1024�Sigmoid

Output before softmax h7: 1000 x 1

Output in one-hot: 1000 x 1

Softmax

68 of 83

Network Dimension

The total parameter to be trained is:

This is small

Function (layer)	# Parameter
conv1	34944
conv2	230496
max pooling	0
conv3	41520
flatten	0
fc1	1229824
fc2	1025000
SoftMax	0
Sum	1229824

69 of 83

Outline

Background of computer vision
Image and Convolution of Image
Tensor and Convolution over Volume
Pooling
Typical network structure
Prominent Structure

70 of 83

LeNet

Almost the first one for MNIST

https://www.researchgate.net/figure/The-LeNet-5-Architecture-a-convolutional-neural-network_fig4_321586653

71 of 83

AlexNet Discussion

The input is 227 x 227 image, 3 channel
Output is 1000 class
62, 378, 344 parameters

https://medium.com/analytics-vidhya/concept-of-alexnet-convolutional-neural-network-6e73b4f9ee30

72 of 83

PyTorch Discussion: AlexNet Definition

https://github.com/pytorch/vision/blob/master/torchvision/models/alexnet.py
Dimensions do not match previous figure, input 224 x 224

73 of 83

PyTroch Demonstration

https://github.com/pytorch/vision/blob/master/torchvision/models/alexnet.py
What has happened?

Someone has trained the AlexNet as shown before
Then we load a dog image, resize to 256 x 256, center crop to 224 x 224
We have 1000 classes, that is why we that 1000 dimension vector displayed (we actually have two, what are the differences?)
Then we match the class ID and output the top 5 probable case

74 of 83

VGG

conv -> max pooling -> conv -> max pooling …
Vgg-16: 138 million parameters

75 of 83

VGG-19

A more comprehensive one after vgg-16
144 Million Parameters
19.6 Billion Floating Point Operations Per Second

Image from geek for geek.

76 of 83

ResNET

77 of 83

ResNet

A 34-layer ResNet is used
Accuracy: (top 5) 96.47%
25.6 million parameters
3.6 billion float point operations (FLOPs)

78 of 83

ResNet

There are many variants

https://arxiv.org/pdf/1512.03385

79 of 83

PyTorch Implementation for Residual block

PyTorch implementation

Tensor,

80 of 83

Residual Block Implementation in PyTorch

81 of 83

Depth-wise separable convolution

Convolution doesn’t need to be 2D only
Further reading: https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728

82 of 83

PNAS Net

Progressive Neural Architecture Search (PNASNet)
A PNAS cell is defined

https://sh-tsang.medium.com/reading-pnasnet-progressive-neural-architecture-search-image-classification-1beb1de06fe6

83 of 83

Summary

Image Net Accuracy

https://paperswithcode.com/sota/image-classification-on-imagenet