1 of 83

ME 5990

Image Recognition�Convolution, Pooling

Partial Slide by Gang Hua, Wompex AI

2 of 83

Outline

  • Background of computer vision
  • Image and Convolution of Image
  • Tensor and Convolution over Volume
  • Pooling
  • Typical network structure
  • Prominent Structure

3 of 83

AlexNet

[0.86, 0.14]

[0.14, 0.86]

Loss Function

Training Labels

Loss

4 of 83

Special effects: shape and motion capture

Source: S. Seitz

5 of 83

3D visualization: Microsoft Photosynth

Source: S. Seitz

6 of 83

Optical character recognition (OCR)

Automatic check processing

Source: S. Seitz

7 of 83

Biometrics

Fingerprint scanners on many new laptops, �other devices

Face recognition systems now beginning to appear more widely�http://www.sensiblevision.com/

Source: S. Seitz

8 of 83

Biometrics

Source: S. Seitz

9 of 83

Mobile visual search

Lincoln, Microsoft Research

10 of 83

Face detection

  • Many new digital cameras now detect faces
    • Canon, Sony, Fuji, …

Source: S. Seitz

11 of 83

Smile detection

Source: S. Seitz

12 of 83

Face annotation

Windows Live Photo Gallery

13 of 83

Automotive safety

  • Mobileye: Vision systems in high-end BMW, GM, Volvo models
    • Pedestrian collision warning
    • Forward collision warning
    • Lane departure warning
    • Headway monitoring and warning

Source: A. Shashua, S. Seitz

14 of 83

Vision for robotics, space exploration

Vision systems (JPL) used for several tasks

    • Panorama stitching
    • 3D terrain modeling
    • Obstacle detection, position tracking
    • For more, read “Computer Vision on Mars” by Matthies et al.

NASA'S Mars Exploration Rover Spirit captured this westward view from atop �a low plateau where Spirit spent the closing months of 2007.

Source: S. Seitz

15 of 83

Outline

  • Background of computer vision
  • Image and Convolution of Image
  • Tensor and Convolution over Volume
  • Pooling
  • Typical network structure
  • Prominent Structure

16 of 83

Image Format: Very Brief Idea

  • Most of color images consists of three channels: R. G. B.

  • Most of image formats (jpeg, tiff, etc.) are compressions of R.G.B. images

https://pursuit.unimelb.edu.au/articles/it-s-time-to-retire-lena-from-computer-science

https://www.geeksforgeeks.org/matlab-rgb-image-representation/

17 of 83

Image Processing: Convolution

  • Very old motivation: how can we denoise of an image?

18 of 83

Convolution: previous we learnt

  •  

19 of 83

Convolution

  • Let f be the image and g be the kernel. The output of convolving f with g is denoted f *g.

f

Source: F. Durand

20 of 83

Convolution

  • 2D Convolution:

21 of 83

Demonstration

  •  

22 of 83

Demonstration

  •  

23 of 83

Annoying details

  • What is the size of the output?

f

g

g

g

g

f

g

g

g

g

f

g

g

g

g

full

same

valid

24 of 83

Annoying Details

  • Convolution is simple
    • But it has a lot of details
  • Reference reading:
    • https://www.baeldung.com/cs/convolutional-layer-size

25 of 83

Stride

  • We can jump

26 of 83

Padding

  • To maintain the size, we then can “pad”

27 of 83

The ultimate equation

  •  

28 of 83

Convolution Application

  • People design various kernels (g) for various purpose:
  • Shifting:

  • Averaging Blurring:

29 of 83

Convolution Application

  • People design various kernels (g) for various purpose:
  • Sharpening:

30 of 83

Sharpening

  • What does blurring take away?

original

smoothed (5x5)

detail

=

sharpened

=

Let’s add it back:

original

detail

+

31 of 83

Gaussian Filter

  • We can blur an image using Gaussian Filter

0.003 0.013 0.022 0.013 0.003

0.013 0.059 0.097 0.059 0.013

0.022 0.097 0.159 0.097 0.022

0.013 0.059 0.097 0.059 0.013

0.003 0.013 0.022 0.013 0.003

32 of 83

Gaussian Blur

  • Gaussian Blur vs. Box Blur

Box Blur

Gaussian Blur

33 of 83

Convolution for Gradient

  • Finite difference filters:

34 of 83

Edge Detection

Original

Canny Edge Detection

35 of 83

About Convolution

  • How do we know what kernel we should use? �
  • Is there anyway to let computers find the filters for us? (Train a filter)

  • Can we apply them on color image?

The dress: what color is it?

https://www.youtube.com/watch?v=n9fwiNyDHLI

36 of 83

Outline

  • Background of computer vision
  • Image and Convolution of Image
  • Tensor and Convolution over Volume
  • Pooling
  • Typical network structure
  • Prominent Structure

37 of 83

Tensor

  • In mathematics, a tensor is an algebraic object that describes a (multilinear) relationship between sets of algebraic objects related to a vector space.
  • You may consider a matrix a 2D tensor
  • You may consider an RGB image 3D tensor

Cauchy Stress Tensor

Image tensor, with dimension 255 x 255 x 3

38 of 83

Convolutions Over Volumes

  • 2D Convolution can be applied over volume:

https://youtu.be/KTB_OFoAQcc

39 of 83

2D convolution on volume

  • If we have an RGB image with (255 x 255 x 3), which of the kernel dimension is valid for a 2d convolution of the image?

  1. 3 x 3 x 3
  2. 255 X 255 X 1
  3. 3 X 3 X 5
  4. None of above

40 of 83

2D convolution on volume

  • If we have an RGB image with (255 x 255 x 3), which of the kernel dimension is valid for a 2d convolution of the image?

  1. 3 x 3 x 3
  2. 255 X 255 X 1
  3. 3 X 3 X 5
  4. None of above

Solution: the 3rd dimension of the input and output shall be the same

41 of 83

2D convolution on volume

  • If we have a tensor (255 x 255 x 10),
    • We want to apply “valid” type 2d convolution, (i.e. no padding)
    • If we apply 3 x 3 x n filter
  • What is the value of n?
  • What is the dimension of the output?

42 of 83

2D convolution on volume

  •  

43 of 83

Multiple Filters

  • We can apply multiple filters for the same dimension. We usually denote the number of filter as the 4th dimension of the filter tensor

44 of 83

Multiple Filters

  • Another visualization of convolution

https://cs231n.github.io/convolutional-networks/

45 of 83

Multiple Filters

  • If we have a tensor 255 x 255 x 3, we want to apply a filter 3 x 3 x n x 64, using the “same” type (i.e., the output shall have the same width and height of the input), with stride = 1
  • What is the value of n?
  • What is the size of padding?
  • What is the output dimension?

46 of 83

Multiple Filters

  •  

47 of 83

Bias

  • Usually when we say we have a “convolution layer”, we will have a convolution mask and a bias value (b) for each filter
  • If we have n filters, we will have n bias

48 of 83

No. of parameters

  • If we have an image with dimension 255 x 255 x 3, assume we use “valid” type convolution. The kernel dimension is 3 x 3 x 3 x 64.
  • How many parameters are there in this filter? (Including bias)

  1. 3
  2. 3 x 3 x 3 = 27
  3. 3 x 3 x 3 x 64 = 1728
  4. 3 x 3 x 3 x 64 + 64 = 1792

49 of 83

No. of parameters

  • If we have an image with dimension 255 x 255 x 3, assume we use “valid” type convolution. The kernel dimension is 3 x 3 x 3 x 64.
  • How many parameters are there in this filter? (Including bias)

  1. 3
  2. 3 x 3 x 3 = 27
  3. 3 x 3 x 3 x 64 = 1728
  4. 3 x 3 x 3 x 64 + 64 = 1792

Solution: D

50 of 83

Summary

  •  

51 of 83

Summary

  •  

52 of 83

Outline

  • Background of computer vision
  • Image and Convolution of Image
  • Tensor and Convolution over Volume
  • Pooling
  • Typical network structure
  • Prominent Structure

53 of 83

Pooling Function

  • Reading: CS231n Convolutional Neural Networks for Visual Recognition
  • A common way to reduce the dimension of a tensor
    • Max pooling: choose the largest among the filter for each channel each stride
    • Average pooling: choose the average among the filter for each channel each stride

54 of 83

Max pooling

  • Pooling always is applied on one slice (one channel)

55 of 83

Pooling function

  •  

56 of 83

Outline

  • Background of computer vision
  • Image and Convolution of Image
  • Tensor and Convolution over Volume
  • Pooling
  • Typical network structure
  • Prominent Structure

57 of 83

Typical convolutional neural network

  • Convolution layer for many times to “extract feature”
  • Pooling layer for many times to “downsize the sample”
  • (It is common to repeat convolution – pooling)
  • Fully connected layer for many times to “train the weight”

  • Lots of hyperparameters to tune
  • Tons of parameters to be trained (luckily not by you)

58 of 83

Typical network arrangement

 

Activation

 

Repeat 2d convolution and activation

 

Repeat pooling and activation

Activation

 

Repeat FC and activation

Flatten

Fully-connected

Activation

Softmax

Output: one-hot

59 of 83

Flatten Layer

  • Simply convert a tensor in any dimension to a vector

60 of 83

Network Dimension

  • While ChatGPT can help you figure out dimension issue, you wish to know it yourself
  • If we have an image with dimension to be 227 x 227 x 3
  • We go through the network to the right, figure out the the values of all ‘?’ markers
  • How many parameters are in this model?

Input 227 x 227 x 3

h1: ? x ? x ?

conv1: 11 x 11 x 3 x 96�stride: 4

padding: 0 �ReLU activation

h2: ? x ? x ?

conv2: 5 x 5 x ? x 96�stride: 8

padding: 2 �ReLU activation

h3: ? x ? x ?

Max pooling 2 x 2

Stride: 2�ReLU activation

61 of 83

Network Dimension

Input 227 x 227 x 3

h1: ? x ? x ?

conv1: 11 x 11 x 3 x 96�stride: 4

padding: 0 �ReLU activation

h2: ? x ? x ?

conv2: 5 x 5 x ? x 96�stride: 8

padding: 2 �ReLU activation

h3: ? x ? x ?

Max pooling 2 x 2

Stride: 2�ReLU activation

h4: ? x ? x ?

Conv 3: 3 x 3 x ? x 48

padding: 0

Stride: 1�ReLU activation

h5: ? x ?

flatten

h6: ? x ?

FC: ? x 1024�Sigmoid

FC: ? x ?�Sigmoid

Output before softmax h7: ? x ?

Output in one-hot: ? x ?

Softmax

62 of 83

Network Dimension

  •  

Input 227 x 227 x 3

h1: 55 x 55 x 96

conv1: 11 x 11 x 3 x 96�stride: 4

padding: 0 �ReLU activation

h2: ? x ? x ?

conv2: 5 x 5 x ? x 96�stride: 8

padding: 2 �ReLU activation

h3: ? x ? x ?

Max pooling 2 x 2

Stride: 2�ReLU activation

63 of 83

Network Dimension

  •  

Input 227 x 227 x 3

h1: 55 x 55 x 96

conv1: 11 x 11 x 3 x 96�stride: 4

padding: 0 �ReLU activation

h2: 14 x 14 x 96

conv2: 5 x 5 x ? x 96�stride: 4

padding: 2 �ReLU activation

h3: ? x ? x ?

Max pooling 2 x 2

Stride: 2�ReLU activation

64 of 83

Network Dimension

  •  

Input 227 x 227 x 3

h1: 55 x 55 x 96

conv1: 11 x 11 x 3 x 96�stride: 4

padding: 0 �ReLU activation

h2: 14 x 14 x 96

conv2: 5 x 5 x ? x 96�stride: 4

padding: 2 �ReLU activation

h3: ? x ? x ?

Max pooling 2 x 2

Stride: 2�ReLU activation

65 of 83

Network Dimension

  •  

h3: 7 x 7 x 96

h4: 5 x 5 x 48

Conv 3: 3 x 3 x 96 x 48

padding: 0

Stride: 1�ReLU activation

h5: ? x ?

flatten

h6: ? x ?

FC: ? x 1024�Sigmoid

FC: ? x ?�Sigmoid

Output before softmax h7: ? x ?

Output in one-hot: ? x ?

Softmax

66 of 83

Network Dimension

  •  

h3: 7 x 7 x 96

h4: 5 x 5 x 48

Conv 3: 3 x 3 x 96 x 48

padding: 0

Stride: 1�ReLU activation

h5: 1200 x 1

flatten

h6: 1024 x 1

FC1: ? x 1024�Sigmoid

FC2: ? x ?�Sigmoid

Output before softmax h7: ? x ?

Output in one-hot: ? x ?

Softmax

67 of 83

Network Dimension

  •  

h3: 7 x 7 x 96

h4: 5 x 5 x 48

Conv 3: 3 x 3 x 96 x 48

padding: 0

Stride: 1�ReLU activation

h5: 1200 x 1

flatten

h6: 1024 x 1

FC1: 1200 x 1024�Sigmoid

FC2: 1000 x 1024�Sigmoid

Output before softmax h7: 1000 x 1

Output in one-hot: 1000 x 1

Softmax

68 of 83

Network Dimension

  • The total parameter to be trained is:

  • This is small

Function (layer)

# Parameter

conv1

34944

conv2

230496

max pooling

0

conv3

41520

flatten

0

fc1

1229824

fc2

1025000

SoftMax

0

Sum

1229824

69 of 83

Outline

  • Background of computer vision
  • Image and Convolution of Image
  • Tensor and Convolution over Volume
  • Pooling
  • Typical network structure
  • Prominent Structure

70 of 83

LeNet

  • Almost the first one for MNIST

https://www.researchgate.net/figure/The-LeNet-5-Architecture-a-convolutional-neural-network_fig4_321586653

71 of 83

AlexNet Discussion

  • The input is 227 x 227 image, 3 channel
  • Output is 1000 class
  • 62, 378, 344 parameters

https://medium.com/analytics-vidhya/concept-of-alexnet-convolutional-neural-network-6e73b4f9ee30

72 of 83

PyTorch Discussion: AlexNet Definition

73 of 83

PyTroch Demonstration

  • https://github.com/pytorch/vision/blob/master/torchvision/models/alexnet.py
  • What has happened?
    • Someone has trained the AlexNet as shown before
    • Then we load a dog image, resize to 256 x 256, center crop to 224 x 224
    • We have 1000 classes, that is why we that 1000 dimension vector displayed (we actually have two, what are the differences?)
    • Then we match the class ID and output the top 5 probable case

74 of 83

VGG

  • conv -> max pooling -> conv -> max pooling …
  • Vgg-16: 138 million parameters

75 of 83

VGG-19

  • A more comprehensive one after vgg-16
  • 144 Million Parameters
  • 19.6 Billion Floating Point Operations Per Second

Image from geek for geek.

76 of 83

ResNET

  •  

77 of 83

ResNet

  • A 34-layer ResNet is used
  • Accuracy: (top 5) 96.47%
  • 25.6 million parameters
  • 3.6 billion float point operations (FLOPs)

78 of 83

ResNet

  • There are many variants

https://arxiv.org/pdf/1512.03385

79 of 83

PyTorch Implementation for Residual block

  • PyTorch implementation

Tensor,

80 of 83

Residual Block Implementation in PyTorch

81 of 83

Depth-wise separable convolution

82 of 83

PNAS Net

  • Progressive Neural Architecture Search (PNASNet)
  • A PNAS cell is defined

https://sh-tsang.medium.com/reading-pnasnet-progressive-neural-architecture-search-image-classification-1beb1de06fe6

83 of 83

Summary

  • Image Net Accuracy

https://paperswithcode.com/sota/image-classification-on-imagenet