1 of 157

Image Classification

Victor Boutin

2 of 157

Outline

  • Motivation
  • Framework for classification
    • Overview
    • Challenges
    • Pipeline
  • Receptive field and convolutions
  • Datasets
  • Models of classification
    • Pre alexnet
    • Post alexnet
  • Trends in deep learning (if time permits)
  • Practical session

3 of 157

Introduction

4 of 157

Overview of recognition tasks

https://arxiv.org/pdf/1405.0312.pdf

5 of 157

What is Computer Vision (CV)?

5

6 of 157

What is Computer Vision (CV)?

6

7 of 157

Task: Classification

7

  • goldfish
  • pigeon

  • skateboard
  • bike
  • rugby ball

  • sandwich
  • egg

Potential Categories:

8 of 157

Task: Classification

8

  • goldfish
  • pigeon

  • skateboard
  • bike
  • rugby ball

  • sandwich
  • egg

Potential Categories:

9 of 157

Task: Object Detection

9

10 of 157

Task: Object Detection

10

11 of 157

Task: Image Segmentation

11

12 of 157

Task: Image Segmentation

12

Automatic

Tumor

segmentation

13 of 157

Image Classification

14 of 157

Image Classification

15 of 157

Image Classification

dog

car

cat

Set of categories

16 of 157

Image Classification

dog

car

cat

Set of categories

17 of 157

Problem

18 of 157

Problem

19 of 157

Problem

168 145 117

Each pixel:

RGB ([R]ed [G]reen [B]lue) representation

20 of 157

Problem

200 226 244 111 81 88 92 81 95 52

16 49 242 135 42 65 89 75 104 66

21 30 24 33 23 44 68 69 103 89

14 22 24 23 26 30 66 65 100 95

17 14 12 14 17 40 57 55 97 97

12 25 19 19 15 50 46 48 105 95

20 20 14 19 28 64 59 54 92 79

23 14 14 16 55 77 54 61 72 65

23 23 25 20 95 81 53 54 62 76

26 22 22 74 134 82 58 61 54 74

200 226 244 111 81 88 92 81 95 52

16 49 242 135 42 65 89 75 104 66

21 30 24 33 23 44 68 69 103 89

14 22 24 23 26 30 66 65 100 95

17 14 12 14 17 40 57 55 97 97

12 25 19 19 15 50 46 48 105 95

20 20 14 19 28 64 59 54 92 79

23 14 14 16 55 77 54 61 72 65

23 23 25 20 95 81 53 54 62 76

26 22 22 74 134 82 58 61 54 74

200 226 244 111 81 88 92 81 95 52

16 49 242 135 42 65 89 75 104 66

21 30 24 33 23 44 68 69 103 89

14 22 24 23 26 30 66 65 100 95

17 14 12 14 17 40 57 55 97 97

12 25 19 19 15 50 46 48 105 95

20 20 14 19 28 64 59 54 92 79

23 14 14 16 55 77 54 61 72 65

23 23 25 20 95 81 53 54 62 76

26 22 22 74 134 82 58 61 54 74

RGB Matrices

21 of 157

Image Classification

dog

car

cat

Set of categories

200 226 244 111 81 88 92 81 95 52

16 49 242 135 42 65 89 75 104 66

21 30 24 33 23 44 68 69 103 89

14 22 24 23 26 30 66 65 100 95

17 14 12 14 17 40 57 55 97 97

12 25 19 19 15 50 46 48 105 95

20 20 14 19 28 64 59 54 92 79

23 14 14 16 55 77 54 61 72 65

23 23 25 20 95 81 53 54 62 76

26 22 22 74 134 82 58 61 54 74

200 226 244 111 81 88 92 81 95 52

16 49 242 135 42 65 89 75 104 66

21 30 24 33 23 44 68 69 103 89

14 22 24 23 26 30 66 65 100 95

17 14 12 14 17 40 57 55 97 97

12 25 19 19 15 50 46 48 105 95

20 20 14 19 28 64 59 54 92 79

23 14 14 16 55 77 54 61 72 65

23 23 25 20 95 81 53 54 62 76

26 22 22 74 134 82 58 61 54 74

200 226 244 111 81 88 92 81 95 52

16 49 242 135 42 65 89 75 104 66

21 30 24 33 23 44 68 69 103 89

14 22 24 23 26 30 66 65 100 95

17 14 12 14 17 40 57 55 97 97

12 25 19 19 15 50 46 48 105 95

20 20 14 19 28 64 59 54 92 79

23 14 14 16 55 77 54 61 72 65

23 23 25 20 95 81 53 54 62 76

26 22 22 74 134 82 58 61 54 74

22 of 157

Image Classification

dog

car

cat

Set of categories

82 %

9 %

0.5 %

8.5 %

200 226 244 111 81 88 92 81 95 52

16 49 242 135 42 65 89 75 104 66

21 30 24 33 23 44 68 69 103 89

14 22 24 23 26 30 66 65 100 95

17 14 12 14 17 40 57 55 97 97

12 25 19 19 15 50 46 48 105 95

20 20 14 19 28 64 59 54 92 79

23 14 14 16 55 77 54 61 72 65

23 23 25 20 95 81 53 54 62 76

26 22 22 74 134 82 58 61 54 74

200 226 244 111 81 88 92 81 95 52

16 49 242 135 42 65 89 75 104 66

21 30 24 33 23 44 68 69 103 89

14 22 24 23 26 30 66 65 100 95

17 14 12 14 17 40 57 55 97 97

12 25 19 19 15 50 46 48 105 95

20 20 14 19 28 64 59 54 92 79

23 14 14 16 55 77 54 61 72 65

23 23 25 20 95 81 53 54 62 76

26 22 22 74 134 82 58 61 54 74

200 226 244 111 81 88 92 81 95 52

16 49 242 135 42 65 89 75 104 66

21 30 24 33 23 44 68 69 103 89

14 22 24 23 26 30 66 65 100 95

17 14 12 14 17 40 57 55 97 97

12 25 19 19 15 50 46 48 105 95

20 20 14 19 28 64 59 54 92 79

23 14 14 16 55 77 54 61 72 65

23 23 25 20 95 81 53 54 62 76

26 22 22 74 134 82 58 61 54 74

23 of 157

Challenges

  • Viewpoint variation. A single instance of an object can be oriented in many ways with respect to the camera.
  • Scale variation. Visual classes often exhibit variation in their size (size in the real world, not only in terms of their extent in the image).
  • Deformation. Many objects of interest are not rigid bodies and can be deformed in extreme ways.
  • Occlusion. The objects of interest can be occluded. Sometimes only a small portion of an object (as little as few pixels) could be visible.
  • Illumination conditions. The effects of illumination are drastic on the pixel level.
  • Background clutter. The objects of interest may blend into their environment, making them hard to identify.
  • Intra-class variation. The classes of interest can often be relatively broad, such as chair. There are many different types of these objects, each with their own appearance,

https://cs231n.github.io/classification/

24 of 157

Challenges

25 of 157

Framework for classification

26 of 157

What matters

Machine learning methods (e.g., linear classification, deep learning)

Representation (e.g., SIFT, HoG, deep learned features)

Data (e.g., PASCAL, ImageNet, COCO)

27 of 157

The statistical learning framework

Apply a prediction function to a feature representation of the image to get the desired output:

dog

car

cat

f

28 of 157

The statistical learning framework

Apply a prediction function to a feature representation of the image to get the desired output:

dog

car

cat

f

Feature representation

x

Output

y

29 of 157

The statistical learning framework

Apply a prediction function to a feature representation of the image to get the desired output:

30 of 157

The statistical learning framework

Apply a prediction function to a feature representation of the image to get the desired output:

X

y

31 of 157

The statistical learning framework

Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the prediction function f by minimizing the prediction error on the training set

Testing: apply f to a never before seen test example x and output the predicted value f(x) = y

f(x) = y

prediction function

feature representation

output

32 of 157

Image categorization

33 of 157

An image classifier:

34 of 157

Data-driven approach

• Collect a database of images with labels

• Use ML to train an image classifier

• Evaluate the classifier on test images

35 of 157

Data-driven approach

What we need:

  • A database of images with labels
  • Use ML to train an image classifier
  • Evaluate the classifier on test images

36 of 157

Linear Classifier

37 of 157

Scores

dog

car

cat

82 %

9 %

0.5 %

8.5 %

38 of 157

Scores

dog

car

cat

12.3

2.2

-5.3

→ softmax

f

39 of 157

Probabilities from scores?

Solution:

Objective:

  • 0 ≤ p ≤ 1
  • pdog + pcat + pcar + ... = 1

Scores:

  • “y”s can have any value
  • Does not sum to 1

40 of 157

Probabilities from scores?

Objective:

  • 0 ≤ p ≤ 1
  • pdog + pcat + pcar + ... = 1

Scores:

  • “y”s can have any value
  • Does not sum to 1

Solution:

  • S = ydog + ycat + ycar + …
  • pdog = ydog / S
  • pdog + pcat + pcar + … = 1

BUT problem:

pdog can be negative and S can be 0!

Solution: Use exp (strictly increasing and strictly positive).

41 of 157

Softmax

42 of 157

Score function

Class score

43 of 157

How to learn f?

dog

car

cat

12.3

2.2

-5.3

f(x)

44 of 157

What is learning?

f1

f2

f3

f4

Set of functions

f*

Best function

Goal:

Find the function that optimizes a criterion.

Criterion:

Best classification scores across a hold-out test set.

45 of 157

Parameterization

f1

f2

f3

f4

Set of functions

f*

Best function

f(x, W)

W1

W2

W3

W4

W*

Set of parameters

46 of 157

How to learn f?

dog

car

cat

12.3

2.2

-5.3

f(x, W)

47 of 157

Score function: f

Parametric approach

32

32

dog

car

cat

12.3

2.2

-5.3

48 of 157

How to learn f?

dog

car

cat

12.3

2.2

-5.3

f(x, W)

f: a neural Network

W: the weights

49 of 157

MLP

Parametric approach

32

32

dog

car

cat

12.3

2.2

-5.3

50 of 157

MLP

Parametric approach

32

32

dog

car

cat

12.3

2.2

-5.3

vector

tensor

51 of 157

MLP

Parametric approach

32

32

dog

car

cat

12.3

2.2

-5.3

vector

= 3072

Vectorize the image

52 of 157

Score function: f

Parametric approach

32

32

dog

car

cat

12.3

2.2

-5.3

= 3072

53 of 157

Score function: f

Parametric approach

32

32

dog

car

cat

12.3

2.2

-5.3

= 3072

bias

54 of 157

Linear Classifier

55 of 157

Linear Classifier

56 of 157

Linear Classifier

57 of 157

Interpretation

58 of 157

Template matching

f(xi, W, b) = Wxi + b

59 of 157

Geometric Interpretation

60 of 157

Linear classifier: Three viewpoints

61 of 157

Convolution & Receptive field

62 of 157

Convolution Layer

63 of 157

Convolution Layer

Filters always extend the full depth of the input volume

64 of 157

Sliding window

3x3 window sliding over the image

Source: https://www.kaggle.com/code/ryanholbrook/the-sliding-window

Input image

output

Sliding kernel

65 of 157

Sliding window: stride

Shift of the kernel: here we shift the window by 2 pixels in each direction:

stride = (2, 2)

Source: https://www.kaggle.com/code/ryanholbrook/the-sliding-window

Input image

output

Sliding kernel

66 of 157

Sliding window: padding

Adding extra values on the border of the image.

Useful to have an output size = input size.

Source: https://www.kaggle.com/code/ryanholbrook/the-sliding-window

Input image

output

Sliding kernel

67 of 157

Convolution Layer

https://setosa.io/ev/image-kernels/

68 of 157

Convolution Layer

69 of 157

Convolution Layer

consider a second, green filter

70 of 157

Convolution Layer

71 of 157

https://miro.medium.com/max/4800/1*QgiVWSD6GscHh9nt55EfXg.gif

72 of 157

Pooling layer

  • makes the representations smaller and more manageable
  • operates over each activation map independently

73 of 157

Max Pooling

74 of 157

Receptive field

The receptive field is the area of the input image that is seen by single neurons in any layer of the network

75 of 157

Receptive field

The receptive field is the area of the input image that is seen by single neurons in any layer of the network

?

76 of 157

Receptive field

The receptive field is the area of the input image that is seen by single neurons in any layer of the network

?

77 of 157

Receptive field

The receptive field is the area of the input image that is seen by single neurons in any layer of the network

?

78 of 157

Receptive field

The receptive field is the area of the input image that is seen by single neurons in any layer of the network

?

79 of 157

Receptive field

The receptive field is the area of the input image that is seen by single neurons in any layer of the network

?

80 of 157

Receptive field: Convolutions

The receptive field is the area of the input image that is seen by single neurons in any layer of the network

Example: 2 convolutional layers with 3x3 filters

81 of 157

Receptive field

Example: 2 convolutional layers with 3x3 filters

Each activation map #1 “sees” an area of 3x3 in the input

82 of 157

Receptive field

Example: 2 convolutional layers with 3x3 filters

Each activation map #2 “sees” an area of 3x3 in map #1

83 of 157

Receptive field

• How much of the input does an activation of map #2 see?

84 of 157

Receptive field

How much of the input does an activation of map #2 see?

It sees a 5x5 region! WHY?

85 of 157

Receptive field

How much of the input does an activation of map #2 see?

It sees a 5x5 region! WHY?

86 of 157

Receptive field

How much of the input does an activation of map #2 see?

It sees a 5x5 region! WHY?

87 of 157

Receptive field

How much of the input does an activation of map #2 see?

It sees a 5x5 region! WHY?

88 of 157

Receptive field

How much of the input does an activation of map #2 see?

It sees a 5x5 region! WHY?

89 of 157

Receptive field

How much of the input does an activation of map #2 see?

It sees a 5x5 region! WHY?

90 of 157

Receptive field

How much of the input does an activation of map #2 see?

It sees a 5x5 region! WHY?

91 of 157

Receptive field

How much of the input does an activation of map #2 see?

It sees a 5x5 region! WHY?

92 of 157

Receptive field

How much of the input does an activation of map #2 see?

It sees a 5x5 region! WHY?

93 of 157

Receptive field

How much of the input does an activation of map #2 see?

It sees a 5x5 region! WHY?

94 of 157

Receptive field

95 of 157

Receptive field

bias

96 of 157

Receptive field

Factors that affect the receptive field:

• Number of layers

• Filter size

26

Two stacked layers of a 3x3 filter give the same receptive field as one layer of a 5x5 filter

97 of 157

Receptive field

Factors that affect the receptive field:

• Number of layers

• Filter size

Presence of max-pooling

Increase receptive field

Decrease resolution

Spatial information is lost

26

Two stacked layers of a 3x3 filter give the same receptive field as one layer of a 5x5 filter

98 of 157

Visual Transformers (ViTs)

99 of 157

Transformers

Details in a future lesson…

x1

x2

x3

xn

Transformer

Input: sequence

100 of 157

Image as a sequence

?

101 of 157

Image as a sequence

1

2

3

4

5

6

102 of 157

Transformers

Transformer

1

2

3

4

5

6

103 of 157

Datasets

104 of 157

Datasets

105 of 157

Datasets

  • MNIST
  • CIFAR-10
  • CIFAR-100
  • ILSVRC2012

106 of 157

MNIST

Classic dataset of handwritten images for benchmarking classification algorithms (1999).

Dataset of 60,000 28x28 grayscale images of the 10 digits,

Test set of 10,000 images.

107 of 157

CIFAR-10

Total 60,000 (50k + 10k) tiny images with dimension 32x32 pixels

Labeled with one of 10 classes (for example “airplane, automobile, bird, etc”).

108 of 157

Over 15M labeled high resolution images

• Roughly 22K categories

• Collected from web and labeled by Amazon Mechanical Turk

109 of 157

ILSVRC Task 2: Classification + Localization

110 of 157

ILSVRC Task 3: Detection

111 of 157

Long tailed Dataset

112 of 157

113 of 157

Models

114 of 157

Ventral visual pathway

  • Mediates Object recognition in cortex
  • V1 - V2 - V4 - IT → PFC
  • Cells found in macaque IT - key role in object recognition
  • Hallmark of those cells - Robustness of their responses to stimulus transformations such as scale and position changes

115 of 157

Key Goals of Object Recognition

● Invariance - the ability to recognize a pattern under various transformations ● Specificity - the ability to discriminate between different patterns.

116 of 157

Neocognitron Fukushima (1980). Hierarchical multilayered neural network

  • Multilayer neural network inspired by the mammalian visual system for handwritten digit classification
  • Unsupervised image classification, tolerant to shifts and deformations
    • Input - unlabeled images
    • Output - vector, with each bit encoding a distinct class of images
  • Generalizing simple-complex cells: alternating S and C layers: S - feature detectors (e.g. simple cells) detect conjunction of features (AND, specificity), C - invariance pooling (e.g. complex cells) (OR, MAX, invariance).

117 of 157

Neocognitron Fukushima (1980). Hierarchical multilayered neural network

Character Recognition

118 of 157

Hierarchical model and X (HMAX)

http://maxlab.neuro.georgetown.edu/hmax.html#c2

119 of 157

Hierarchical model and X (HMAX)

  • A hierarchical build-up of invariances first to position and scale and then to viewpoint and more complex transformations requiring the interpolation between several different object views.
  • In parallel, an increasing size of the receptive fields.
  • An increasing complexity of the optimal stimuli for the neurons.
  • A basic feedforward processing of information (for "immediate" recognition tasks)

120 of 157

121 of 157

AlexNet (2012 Winner, 15.3% error rate)

  • Used ImageNet data (1.5M images, 1k classes) -- winner 2012
  • Prior to this error rate was 25% and they brought it down to 15.3
  • 8 layers where 5 are convolutional layers and 3 fully-connected layers
  • ReLU activation function to add nonlinearity and improve the convergence rate
  • leveraged multiple GPUs for faster training
  • 60 million parameters making it susceptible to overfitting → dropout

122 of 157

AlexNet

123 of 157

ZFNet (2013 Winner, 11.2% error rate)

ZFNet used 7×7 sized filters, in contrast to 11x11 used in AlexNet to avoid the loss of pixel information.

124 of 157

GoogLeNet (2014 winner, 6.67% error rate)

• 22 layers

• Introduces a new layer architecture

125 of 157

126 of 157

GoogLeNet

127 of 157

VGG

  • 19 layers
  • Fixed filter size: 3x3
  • Convolution Layer filters size 3 X 3 and stride = 1 and
  • Max-pooling Layer has filters of size 2 X 2 and stride = 2.
  • 140 millions parameters

128 of 157

VGG

Results on ILSVRC-2014

129 of 157

130 of 157

131 of 157

ResNet 2015

132 of 157

133 of 157

Practical Session

134 of 157

Colab Tutorial

135 of 157

Colab Tutorials (Advanced)

136 of 157

Appendix

137 of 157

Trends

138 of 157

139 of 157

140 of 157

141 of 157

142 of 157

143 of 157

144 of 157

CIFAR 100

100 classes containing 600 images each.

There are 500 training images and 100 testing images per class.

The 100 classes in the CIFAR-100 are grouped into 20 superclasses.

Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).

145 of 157

CIFAR 100

146 of 157

The PASCAL Visual Object Classes

The PASCAL VOC project:

  • Provides standardised image data sets for object class recognition
  • Provides a common set of tools for accessing the data sets and annotations
  • Enables evaluation and comparison of different methods
  • Ran challenges evaluating performance on object class recognition (from 2005-2012, now finished)

147 of 157

PASCAL Visual Object Challenge

148 of 157

Caltech 101

  • Pictures of objects belonging to 101 categories.
  • About 40 to 800 images per category. Most categories have about 50 images.
  • Collected in September 2003 by Fei-Fei Li, Marco Andreetto, and Marc 'Aurelio Ranzato.
  • The size of each image is roughly 300 x 200 pixels.

http://www.vision.caltech.edu/Image_Datasets/Caltech101/#Description

149 of 157

Large Scale Visual Recognition Challenge

150 of 157

ILSVRC

151 of 157

152 of 157

Medical Image Classification Datasets

1. Recursion Cellular Image Classification – This data comes from the Recursion 2019 challenge. This goal of the competition was to use biological microscopy data to develop a model that identifies replicates.

2. TensorFlow patch_camelyon Medical Images – This medical image classification dataset comes from the TensorFlow website. It contains just over 327,000 color images, each 96 x 96 pixels. The images are histopathological lymph node scans which contain metastatic tissue.

153 of 157

Agriculture and Scene Datasets

3. CoastSat Image Classification Dataset – Used for an open-source shoreline mapping tool, this dataset includes aerial images taken from satellites. The dataset also includes meta data pertaining to the labels.

4. Images for Weather Recognition – Used for multi-class weather recognition, this dataset is a collection of 1125 images divided into four categories. The image categories are sunrise, shine, rain, and cloudy.

5. Indoor Scenes Images – From MIT, this dataset contains over 15,000 images of indoor locations. The dataset was originally built to tackle the problem of indoor scene recognition. All images are in JPEG format and have been divided into 67 categories. The number of images per category vary. However, there are at least 100 images for each category.

154 of 157

Agriculture and Scene Datasets

6. Intel Image Classification – Created by Intel for an image classification contest, this expansive image dataset contains approximately 25,000 images. Furthermore, the images are divided into the following categories: buildings, forest, glacier, mountain, sea, and street. The dataset has been divided into folders for training, testing, and prediction. The training folder includes around 14,000 images and the testing folder has around 3,000 images. Finally, the prediction folder includes around 7,000 images.

7. TensorFlow Sun397 Image Classification Dataset – Another dataset from Tensorflow, this dataset contains over 108,000 images used in the Scene Understanding (SUN) benchmark. Furthermore, the images have been divided into 397 categories. The exact amount of images in each category varies. However, there are at least 100 images in each of the various scene and object categories.

155 of 157

Other Image Classification Datasets

8. Architectural Heritage Elements – This dataset was created to train models that could classify architectural images, based on cultural heritage. It contains over 10,000 images divided into 10 categories. The categories are: altar, apse, bell tower, column, dome (inner), dome (outer), flying buttress, gargoyle, stained glass, and vault.

9. Image Classification: People and Food – This dataset comes in CSV format and consists of images of people eating food. Human annotators classified the images by gender and age. The CSV file includes 587 rows of data with URLs linking to each image.

10. Images of Cracks in Concrete for Classification – From Mendeley, this dataset includes 40,000 images of concrete. Each image is 227 x 227 pixels, with half of the images including concrete with cracks and half without.

156 of 157

Other database

157 of 157

Neocognitron Fukushima (1980). Hierarchical multilayered neural network

S-cells work as feature-extracting cells. They resemble simple cells of the primary visual cortex in their response.

C-cells, which resembles complex cells in the visual cortex, are inserted in the network to allow for positional errors in the features of the stimulus. The input connections of C-cells, which come from S-cells of the preceding layer, are fixed and invariable. Each C-cell receives excitatory input connections from a group of S-cells that extract the same feature, but from slightly different positions. The C-cell responds if at least one of these S-cells yield an output.