Image Classification
Victor Boutin
Outline
Introduction
Overview of recognition tasks
https://arxiv.org/pdf/1405.0312.pdf
What is Computer Vision (CV)?
5
What is Computer Vision (CV)?
6
Task: Classification
7
…
…
…
Potential Categories:
Task: Classification
8
…
…
…
Potential Categories:
Task: Object Detection
9
Task: Object Detection
10
Task: Image Segmentation
11
Task: Image Segmentation
12
Automatic
Tumor
segmentation
Image Classification
Image Classification
Image Classification
dog
car
cat
…
Set of categories
Image Classification
dog
car
cat
…
✅
❌
❌
Set of categories
Problem
Problem
Problem
168 145 117
Each pixel:
RGB ([R]ed [G]reen [B]lue) representation
Problem
200 226 244 111 81 88 92 81 95 52
16 49 242 135 42 65 89 75 104 66
21 30 24 33 23 44 68 69 103 89
14 22 24 23 26 30 66 65 100 95
17 14 12 14 17 40 57 55 97 97
12 25 19 19 15 50 46 48 105 95
20 20 14 19 28 64 59 54 92 79
23 14 14 16 55 77 54 61 72 65
23 23 25 20 95 81 53 54 62 76
26 22 22 74 134 82 58 61 54 74
200 226 244 111 81 88 92 81 95 52
16 49 242 135 42 65 89 75 104 66
21 30 24 33 23 44 68 69 103 89
14 22 24 23 26 30 66 65 100 95
17 14 12 14 17 40 57 55 97 97
12 25 19 19 15 50 46 48 105 95
20 20 14 19 28 64 59 54 92 79
23 14 14 16 55 77 54 61 72 65
23 23 25 20 95 81 53 54 62 76
26 22 22 74 134 82 58 61 54 74
200 226 244 111 81 88 92 81 95 52
16 49 242 135 42 65 89 75 104 66
21 30 24 33 23 44 68 69 103 89
14 22 24 23 26 30 66 65 100 95
17 14 12 14 17 40 57 55 97 97
12 25 19 19 15 50 46 48 105 95
20 20 14 19 28 64 59 54 92 79
23 14 14 16 55 77 54 61 72 65
23 23 25 20 95 81 53 54 62 76
26 22 22 74 134 82 58 61 54 74
RGB Matrices
Image Classification
dog
car
cat
…
Set of categories
200 226 244 111 81 88 92 81 95 52
16 49 242 135 42 65 89 75 104 66
21 30 24 33 23 44 68 69 103 89
14 22 24 23 26 30 66 65 100 95
17 14 12 14 17 40 57 55 97 97
12 25 19 19 15 50 46 48 105 95
20 20 14 19 28 64 59 54 92 79
23 14 14 16 55 77 54 61 72 65
23 23 25 20 95 81 53 54 62 76
26 22 22 74 134 82 58 61 54 74
200 226 244 111 81 88 92 81 95 52
16 49 242 135 42 65 89 75 104 66
21 30 24 33 23 44 68 69 103 89
14 22 24 23 26 30 66 65 100 95
17 14 12 14 17 40 57 55 97 97
12 25 19 19 15 50 46 48 105 95
20 20 14 19 28 64 59 54 92 79
23 14 14 16 55 77 54 61 72 65
23 23 25 20 95 81 53 54 62 76
26 22 22 74 134 82 58 61 54 74
200 226 244 111 81 88 92 81 95 52
16 49 242 135 42 65 89 75 104 66
21 30 24 33 23 44 68 69 103 89
14 22 24 23 26 30 66 65 100 95
17 14 12 14 17 40 57 55 97 97
12 25 19 19 15 50 46 48 105 95
20 20 14 19 28 64 59 54 92 79
23 14 14 16 55 77 54 61 72 65
23 23 25 20 95 81 53 54 62 76
26 22 22 74 134 82 58 61 54 74
Image Classification
dog
car
cat
…
Set of categories
82 %
9 %
0.5 %
8.5 %
200 226 244 111 81 88 92 81 95 52
16 49 242 135 42 65 89 75 104 66
21 30 24 33 23 44 68 69 103 89
14 22 24 23 26 30 66 65 100 95
17 14 12 14 17 40 57 55 97 97
12 25 19 19 15 50 46 48 105 95
20 20 14 19 28 64 59 54 92 79
23 14 14 16 55 77 54 61 72 65
23 23 25 20 95 81 53 54 62 76
26 22 22 74 134 82 58 61 54 74
200 226 244 111 81 88 92 81 95 52
16 49 242 135 42 65 89 75 104 66
21 30 24 33 23 44 68 69 103 89
14 22 24 23 26 30 66 65 100 95
17 14 12 14 17 40 57 55 97 97
12 25 19 19 15 50 46 48 105 95
20 20 14 19 28 64 59 54 92 79
23 14 14 16 55 77 54 61 72 65
23 23 25 20 95 81 53 54 62 76
26 22 22 74 134 82 58 61 54 74
200 226 244 111 81 88 92 81 95 52
16 49 242 135 42 65 89 75 104 66
21 30 24 33 23 44 68 69 103 89
14 22 24 23 26 30 66 65 100 95
17 14 12 14 17 40 57 55 97 97
12 25 19 19 15 50 46 48 105 95
20 20 14 19 28 64 59 54 92 79
23 14 14 16 55 77 54 61 72 65
23 23 25 20 95 81 53 54 62 76
26 22 22 74 134 82 58 61 54 74
Challenges
https://cs231n.github.io/classification/
Challenges
Framework for classification
What matters
Machine learning methods (e.g., linear classification, deep learning)
Representation (e.g., SIFT, HoG, deep learned features)
Data (e.g., PASCAL, ImageNet, COCO)
The statistical learning framework
Apply a prediction function to a feature representation of the image to get the desired output:
dog
car
cat
…
f
The statistical learning framework
Apply a prediction function to a feature representation of the image to get the desired output:
dog
car
cat
…
f
Feature representation
x
Output
y
The statistical learning framework
Apply a prediction function to a feature representation of the image to get the desired output:
The statistical learning framework
Apply a prediction function to a feature representation of the image to get the desired output:
X
y
The statistical learning framework
• Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the prediction function f by minimizing the prediction error on the training set
• Testing: apply f to a never before seen test example x and output the predicted value f(x) = y
f(x) = y
prediction function
feature representation
output
Image categorization
An image classifier:
Data-driven approach
• Collect a database of images with labels
• Use ML to train an image classifier
• Evaluate the classifier on test images
Data-driven approach
What we need:
Linear Classifier
Scores
dog
car
cat
…
82 %
9 %
0.5 %
8.5 %
Scores
dog
car
cat
…
12.3
2.2
-5.3
→ softmax
f
Probabilities from scores?
Solution:
Objective:
Scores:
Probabilities from scores?
Objective:
Scores:
Solution:
BUT problem:
pdog can be negative and S can be 0!
Solution: Use exp (strictly increasing and strictly positive).
Softmax
Score function
Class score
How to learn f?
dog
car
cat
…
12.3
2.2
-5.3
f(x)
What is learning?
f1
f2
f3
f4
Set of functions
f*
Best function
Goal:
Find the function that optimizes a criterion.
Criterion:
Best classification scores across a hold-out test set.
Parameterization
f1
f2
f3
f4
Set of functions
f*
Best function
f(x, W)
W1
W2
W3
W4
W*
Set of parameters
How to learn f?
dog
car
cat
…
12.3
2.2
-5.3
f(x, W)
Score function: f
Parametric approach
32
32
dog
car
cat
…
12.3
2.2
-5.3
How to learn f?
dog
car
cat
…
12.3
2.2
-5.3
f(x, W)
f: a neural Network
W: the weights
MLP
Parametric approach
32
32
dog
car
cat
…
12.3
2.2
-5.3
MLP
Parametric approach
32
32
dog
car
cat
…
12.3
2.2
-5.3
vector
tensor
MLP
Parametric approach
32
32
dog
car
cat
…
12.3
2.2
-5.3
vector
= 3072
Vectorize the image
Score function: f
Parametric approach
32
32
dog
car
cat
…
12.3
2.2
-5.3
= 3072
Score function: f
Parametric approach
32
32
dog
car
cat
…
12.3
2.2
-5.3
= 3072
bias
Linear Classifier
Linear Classifier
Linear Classifier
Interpretation
Template matching
f(xi, W, b) = Wxi + b
Geometric Interpretation
Linear classifier: Three viewpoints
Convolution & Receptive field
Convolution Layer
Convolution Layer
Filters always extend the full depth of the input volume
Sliding window
3x3 window sliding over the image
Source: https://www.kaggle.com/code/ryanholbrook/the-sliding-window
Input image
output
Sliding kernel
Sliding window: stride
Shift of the kernel: here we shift the window by 2 pixels in each direction:
stride = (2, 2)
Source: https://www.kaggle.com/code/ryanholbrook/the-sliding-window
Input image
output
Sliding kernel
Sliding window: padding
Adding extra values on the border of the image.
Useful to have an output size = input size.
Source: https://www.kaggle.com/code/ryanholbrook/the-sliding-window
Input image
output
Sliding kernel
Convolution Layer
https://setosa.io/ev/image-kernels/
Convolution Layer
Convolution Layer
consider a second, green filter
Convolution Layer
https://miro.medium.com/max/4800/1*QgiVWSD6GscHh9nt55EfXg.gif
Pooling layer
Max Pooling
Receptive field
The receptive field is the area of the input image that is seen by single neurons in any layer of the network
Receptive field
The receptive field is the area of the input image that is seen by single neurons in any layer of the network
?
Receptive field
The receptive field is the area of the input image that is seen by single neurons in any layer of the network
?
Receptive field
The receptive field is the area of the input image that is seen by single neurons in any layer of the network
?
Receptive field
The receptive field is the area of the input image that is seen by single neurons in any layer of the network
?
Receptive field
The receptive field is the area of the input image that is seen by single neurons in any layer of the network
?
Receptive field: Convolutions
The receptive field is the area of the input image that is seen by single neurons in any layer of the network
Example: 2 convolutional layers with 3x3 filters
Receptive field
Example: 2 convolutional layers with 3x3 filters
Each activation map #1 “sees” an area of 3x3 in the input
Receptive field
Example: 2 convolutional layers with 3x3 filters
Each activation map #2 “sees” an area of 3x3 in map #1
Receptive field
• How much of the input does an activation of map #2 see?
Receptive field
How much of the input does an activation of map #2 see?
It sees a 5x5 region! WHY?
Receptive field
How much of the input does an activation of map #2 see?
It sees a 5x5 region! WHY?
Receptive field
How much of the input does an activation of map #2 see?
It sees a 5x5 region! WHY?
Receptive field
How much of the input does an activation of map #2 see?
It sees a 5x5 region! WHY?
Receptive field
How much of the input does an activation of map #2 see?
It sees a 5x5 region! WHY?
Receptive field
How much of the input does an activation of map #2 see?
It sees a 5x5 region! WHY?
Receptive field
How much of the input does an activation of map #2 see?
It sees a 5x5 region! WHY?
Receptive field
How much of the input does an activation of map #2 see?
It sees a 5x5 region! WHY?
Receptive field
How much of the input does an activation of map #2 see?
It sees a 5x5 region! WHY?
Receptive field
How much of the input does an activation of map #2 see?
It sees a 5x5 region! WHY?
Receptive field
Receptive field
bias
Receptive field
Factors that affect the receptive field:
• Number of layers
• Filter size
26
Two stacked layers of a 3x3 filter give the same receptive field as one layer of a 5x5 filter
Receptive field
Factors that affect the receptive field:
• Number of layers
• Filter size
Presence of max-pooling
• Increase receptive field
• Decrease resolution
• Spatial information is lost
26
Two stacked layers of a 3x3 filter give the same receptive field as one layer of a 5x5 filter
Visual Transformers (ViTs)
Transformers
Details in a future lesson…
x1
x2
x3
xn
Transformer
…
Input: sequence
Image as a sequence
?
Image as a sequence
1
2
3
4
5
6
Transformers
Transformer
1
2
3
4
5
6
Datasets
Datasets
Datasets
MNIST
Classic dataset of handwritten images for benchmarking classification algorithms (1999).
Dataset of 60,000 28x28 grayscale images of the 10 digits,
Test set of 10,000 images.
CIFAR-10
Total 60,000 (50k + 10k) tiny images with dimension 32x32 pixels
Labeled with one of 10 classes (for example “airplane, automobile, bird, etc”).
Over 15M labeled high resolution images
• Roughly 22K categories
• Collected from web and labeled by Amazon Mechanical Turk
ILSVRC Task 2: Classification + Localization
ILSVRC Task 3: Detection
Long tailed Dataset
Models
Ventral visual pathway
Key Goals of Object Recognition
● Invariance - the ability to recognize a pattern under various transformations ● Specificity - the ability to discriminate between different patterns.
Neocognitron Fukushima (1980). Hierarchical multilayered neural network
Neocognitron Fukushima (1980). Hierarchical multilayered neural network
Character Recognition
Hierarchical model and X (HMAX)
http://maxlab.neuro.georgetown.edu/hmax.html#c2
Hierarchical model and X (HMAX)
AlexNet (2012 Winner, 15.3% error rate)
AlexNet
ZFNet (2013 Winner, 11.2% error rate)
ZFNet used 7×7 sized filters, in contrast to 11x11 used in AlexNet to avoid the loss of pixel information.
GoogLeNet (2014 winner, 6.67% error rate)
• 22 layers
• Introduces a new layer architecture
GoogLeNet
VGG
VGG
Results on ILSVRC-2014
ResNet 2015
Practical Session
Colab Tutorial
Colab Tutorials (Advanced)
Appendix
Trends
CIFAR 100
100 classes containing 600 images each.
There are 500 training images and 100 testing images per class.
The 100 classes in the CIFAR-100 are grouped into 20 superclasses.
Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).
CIFAR 100
The PASCAL Visual Object Classes
The PASCAL VOC project:
PASCAL Visual Object Challenge
Caltech 101
http://www.vision.caltech.edu/Image_Datasets/Caltech101/#Description
Large Scale Visual Recognition Challenge
ILSVRC
Medical Image Classification Datasets
1. Recursion Cellular Image Classification – This data comes from the Recursion 2019 challenge. This goal of the competition was to use biological microscopy data to develop a model that identifies replicates.
2. TensorFlow patch_camelyon Medical Images – This medical image classification dataset comes from the TensorFlow website. It contains just over 327,000 color images, each 96 x 96 pixels. The images are histopathological lymph node scans which contain metastatic tissue.
Agriculture and Scene Datasets
3. CoastSat Image Classification Dataset – Used for an open-source shoreline mapping tool, this dataset includes aerial images taken from satellites. The dataset also includes meta data pertaining to the labels.
4. Images for Weather Recognition – Used for multi-class weather recognition, this dataset is a collection of 1125 images divided into four categories. The image categories are sunrise, shine, rain, and cloudy.
5. Indoor Scenes Images – From MIT, this dataset contains over 15,000 images of indoor locations. The dataset was originally built to tackle the problem of indoor scene recognition. All images are in JPEG format and have been divided into 67 categories. The number of images per category vary. However, there are at least 100 images for each category.
Agriculture and Scene Datasets
6. Intel Image Classification – Created by Intel for an image classification contest, this expansive image dataset contains approximately 25,000 images. Furthermore, the images are divided into the following categories: buildings, forest, glacier, mountain, sea, and street. The dataset has been divided into folders for training, testing, and prediction. The training folder includes around 14,000 images and the testing folder has around 3,000 images. Finally, the prediction folder includes around 7,000 images.
7. TensorFlow Sun397 Image Classification Dataset – Another dataset from Tensorflow, this dataset contains over 108,000 images used in the Scene Understanding (SUN) benchmark. Furthermore, the images have been divided into 397 categories. The exact amount of images in each category varies. However, there are at least 100 images in each of the various scene and object categories.
Other Image Classification Datasets
8. Architectural Heritage Elements – This dataset was created to train models that could classify architectural images, based on cultural heritage. It contains over 10,000 images divided into 10 categories. The categories are: altar, apse, bell tower, column, dome (inner), dome (outer), flying buttress, gargoyle, stained glass, and vault.
9. Image Classification: People and Food – This dataset comes in CSV format and consists of images of people eating food. Human annotators classified the images by gender and age. The CSV file includes 587 rows of data with URLs linking to each image.
10. Images of Cracks in Concrete for Classification – From Mendeley, this dataset includes 40,000 images of concrete. Each image is 227 x 227 pixels, with half of the images including concrete with cracks and half without.
Other database
Neocognitron Fukushima (1980). Hierarchical multilayered neural network
S-cells work as feature-extracting cells. They resemble simple cells of the primary visual cortex in their response.
C-cells, which resembles complex cells in the visual cortex, are inserted in the network to allow for positional errors in the features of the stimulus. The input connections of C-cells, which come from S-cells of the preceding layer, are fixed and invariable. Each C-cell receives excitatory input connections from a group of S-cells that extract the same feature, but from slightly different positions. The C-cell responds if at least one of these S-cells yield an output.