1 of 72

Artificial Neural Networks (ANN) and Convolutional Neural Networks (CNN): Image Processing

2 of 72

Housing Price Prediction

size of house

price

3 of 72

Housing Price Prediction

4 of 72

Housing Price Prediction

#bedrooms

zip code

wealth

size

y

5 of 72

Supervised Learning

Output (y)

Application

Input(x)

Click on ad? (0/1)

Online Advertising

Ad, user info

Object (1,…,1000)

Photo tagging

Image

Text transcript

Speech recognition

Audio

Price

Real Estate

Home features

Chinese

Machine translation

English

Position of other cars

Autonomous driving

Image, Radar info

6 of 72

Neural Network examples

Standard NN

Recurrent NN

Convolutional NN

7 of 72

Size	#bedrooms	…	Price (1000$s)
2104 1600 2400 ⋮ 3000	3 3 3 ⋮ 4		400 330 369 ⋮ 540

Structured Data

Supervised Learning

User Age	Ad Id	…	Click
41 80 18 ⋮ 27	93242 93287 87312 ⋮ 71244		1 0 1 ⋮ 1

Unstructured Data

Image

Four scores and seven years ago…

Text

Audio

8 of 72

Neural Networks

Subset of machine learning models that are inspired from the working of biological neurons
The “building blocks” of neural networks are the neurons.

In technical systems, we also refer to them as units or nodes.

Basically, each neuron

receives input from many other neurons.
changes its internal state (activation) based on the current input.
sends one output signal to many other neurons, possibly including its input neurons (recurrent network).

9 of 72

How do our brains work?

A processing element

Dendrites: Input
Cell body: Processor
Synaptic: Link
Axon: Output

A neuron is connected to other neurons through about 10,000 synapses
A neuron receives input from other neurons. Inputs are combined
Once input exceeds a critical level, the neuron discharges a spike ‐ an electrical pulse that travels from the body, down the axon, to the next neuron(s)

10 of 72

Artificial Neurons

Inspired from Biological Neurons

Information is received from multiple inputs, processed and a combined output is generated.

Does not mimic all the functionality of biological neuron, it is a highly restrictive model

Dendrites: Input

Cell body: Processor

Axon terminals (Synaptic): Link

Axon: Output

An artificial neuron is an imitation of a human neuron

11 of 72

Artificial Neurons

• Now, let us have a look at the model of an artificial neuron.

12 of 72

How do ANNs work?

Output

x1

x2

xm

∑

y

Processing

Input

∑= X1+X2 + ….+Xm =y

. . . . . . . . . . . .

13 of 72

How do ANNs work?

Not all inputs are equal

Output

x1

x2

xm

∑

y

Processing

Input

∑= X1w1+X2w2 + ….+Xmwm =y

w₁

w₂

w_m

weights

. . . . . . . . . . . .

. . . . .

14 of 72

How do ANNs work?

The signal is not passed down to the next neuron verbatim

Transfer Function (Activation Function)

Output

x1

x2

xm

∑

y

Processing

Input

w₁

w₂

w_m

weights

. . . . . . . . . . . .

f(v_k)

. . . . .

15 of 72

How do ANNs work?

Inputs are multiplied with weights

Weighted inputs are added together with bias

The sum is passed through activation function f

16 of 72

Simple Neural Network

Neural Network with 2 inputs, 2 hidden nodes and one output

17 of 72

Activation Functions

Activation functions keep the output of a neuron restricted to a certain limit

Important function is to introduce non-linearity into the neural network

Sigmoid

ReLU (Rectified Linear Unit)

18 of 72

Activation Functions

Sigmoid

ReLU (Rectified Linear Unit)

Threshold

Tanh

19 of 72

Example

Goal is to find the function y

in terms of x1 and x2

We are given the dataset on the right

Let’s solve this using neural networks!

20 of 72

Simple Neural Network

Neural Network with 2 inputs, 1 hidden nodes and one output

21 of 72

Training a Neural Network

Loss function

Loss function helps us understand ”How good the model is doing”. One way to evaluate our model

Mean Squared Error (Regression Problems)

Cross Entropy Loss / Negative Log Likelihood (Classification problems)

22 of 72

Backpropagation

How to modify weights given the loss, so that loss can be minimized

Loss function can be written in terms of the weights and biases

Calculate how much loss changes when we tweak the weights (learnable parameters)

The system of calculating partial derivatives by working backwards is called backpropagation

23 of 72

Optimization – Gradient Descent

How to update the weights?

Optimization function tells us how to update the weights and biases to minimize the loss

n – learning rate that controls how fast we train. It is a hyper parameter.

A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data.

24 of 72

Gradient Descent intuition

Loss function graphed

We want to find the value of

weights such that loss is minimized

25 of 72

Neural Network Training – Put it all together

1. Choose a sample data point from the dataset.

2. Pass the input through the network

3. Calculate all the partial derivatives of loss with respect to weights or biases (e.g. ∂L/∂w1, ∂L/∂w2, etc).

4. Use the optimization equation to update each weight and bias.

5. Go back to step 1.

26 of 72

Step 1: Feed Forward Step

For simplicity, assume initial weights are 0.5 and biases are 0.

Let’s take activation function as ReLU

Start with first sample from dataset

27 of 72

Step 2: Calculate Loss and Backpropagation

Let’s take Loss function as MSE

28 of 72

Step 3: Update weights with gradient descent

29 of 72

Step 4: Repeat the process

Now we repeat the process with other samples from the dataset and update the weights

We can stop this if our loss value is zero, then we approximated the function y

Or, we can stop after a set number of maximum iterations. (eg. 100 max iterations)

30 of 72

Deep neural networks

Neural network with more than 2 layers
DNNs are a specific type of ANN characterized by their depth, meaning they have many hidden layers.
Can model more complex functions

31 of 72

Anatomy of a deep neural networks

Layers
Input data and targets
Loss function
Optimizer

32 of 72

Layers

Data processing modules
Many different kinds exist

densely connected
convolutional
recurrent
pooling, flattening, merging, normalization, etc.

Input: one or more tensors�output: one or more tensors
Usually have a state, encoded as weights

learned, initially random

When combined, form a network or�a model

33 of 72

Input data and targets

The network maps the input data X to predictions Y′
During training, the predictions Y′ are compared to true targets Y using the loss function

cat

dog

34 of 72

Loss function

The quantity to be minimized (optimized) during training

the only thing the network cares about
there might also be other metrics you care about

Common tasks have “standard” loss functions:

mean squared error for regression
binary cross-entropy for two-class classification
categorical cross-entropy for multi-class classification
etc.

https://lossfunctions.tumblr.com/

35 of 72

Optimizer

How to update the weights based on the loss function
Learning rate (+scheduling)
Stochastic gradient descent, momentum, and their variants

RMSProp is usually a good first choice
more info: http://ruder.io/optimizing-gradient-descent/

Animation from: https://imgur.com/s25RsOr

36 of 72

Anatomy of a deep neural networks

37 of 72

Smaller Network: CNN

We know it is good to learn a small model.
From this fully connected model, do we really need all the edges?
Can some of these be shared?

38 of 72

A CNN arranges its neurons in three dimensions (width, height, depth). Every layer of a CNN transforms the 3D input volume to a 3D output volume. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels).

A regular 3-layer Neural Network.

39 of 72

Consider learning an image:

Some patterns are much smaller than the whole image

“beak” detector

Can represent a small region with fewer parameters

40 of 72

Same pattern appears in different places:�They can be compressed!�What about training a lot of such “small” detectors�and each detector must “move around”.

“upper-left beak” detector

“middle beak” detector

They can be compressed

to the same parameters.

41 of 72

What are Convolutional Neural Networks (CNNs)

CNNs are similar to MLPs since they only feed signals forward (feedforward nets), but have different kind of layers unique to CNNs

Convolutional Layer: process data in a small receptive field (i.e., filter)
Pooling layer: Pooling layers perform spatial down-sampling by selecting the most important information from the feature maps produced by the convolutional layers
Dense (fully connected) layer: These layers consolidate high-level features and make final decisions

42 of 72

A simple CNN structure

CONV: Convolutional kernel layer

RELU: Activation function

POOL: Dimension reduction layer

FC: Fully connection layer

43 of 72

Convolutional Layer

Convolutional layers apply a set of learnable filters to input data to extract meaningful features

Filters also known as kernels

Designed to detect patterns and hierarchies of features
Learnable Filters (Kernels)

these filters are adjusted through backpropagation to capture relevant features in the data
learn the most informative patterns

44 of 72

Convolutional Kernels

A convolutional kernel, often referred to as a filter or convolutional filter, is a small matrix used for feature extraction
Kernels are applied to the input data through convolution operations
Kernels serve as feature detectors

Each kernel specializes in detecting a specific pattern or feature within the input data
For example, one kernel might detect edges, while another could detect corners

Learnable Parameters

Kernels contain learnable parameters that are adjusted during training
These parameters determine the filter's behavior and help it adapt to detect relevant features in the data

Size

Kernels are typically square and have a small size, such as 3x3 or 5x5
The size of the kernel affects the receptive field

Kernel

45 of 72

Feature Maps

Feature map

also known as an activation map or a convolutional feature map
A two-dimensional grid of values resulting from the application of a convolutional kernel to the input data
Each feature map corresponds to a specific kernel

It represents the presence or activation of particular features
A way to visualize which patterns or structures the network has detected
The number of feature maps in a layer is determined by the number of kernels or filters used

Feature Activation Map

46 of 72

What is a Convolution?

Weighted moving sum

Input

Feature Activation Map

...

47 of 72

Convolution

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

6 x 6 image

1	-1	-1
-1	1	-1
-1	-1	1

Filter 1

-1	1	-1
-1	1	-1
-1	1	-1

Filter 2

……

These are the network parameters to be learned.

Each filter detects a small pattern (3 x 3).

48 of 72

Convolution

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

6 x 6 image

1	-1	-1
-1	1	-1
-1	-1	1

Filter 1

3

-1

stride=1

Dot

product

49 of 72

Convolution

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

6 x 6 image

1	-1	-1
-1	1	-1
-1	-1	1

Filter 1

3

-3

If stride=2

50 of 72

Convolution

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

6 x 6 image

1	-1	-1
-1	1	-1
-1	-1	1

Filter 1

3

-1

-3

-1

-3

1

0

-3

0

1

3

-2

-1

stride=1

51 of 72

Convolution

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

6 x 6 image

3

-1

-3

-1

-3

1

0

-3

0

1

3

-2

-1

-1	1	-1
-1	1	-1
-1	1	-1

Filter 2

-1

-2

1

-1

-2

1

-1

0

-4

3

Repeat this for each filter

stride=1

Two 4 x 4 images

Forming 2 x 4 x 4 matrix

Feature

Map

52 of 72

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

image

convolution

-1	1	-1
-1	1	-1
-1	1	-1

1	-1	-1
-1	1	-1
-1	-1	1

……

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

Convolution v.s. Fully Connected

Fully-connected

53 of 72

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

6 x 6 image

1	-1	-1
-1	1	-1
-1	-1	1

Filter 1

1

2

3

…

8

9

…

13

14

15

…

Only connect to 9 inputs, not fully connected

4:

10:

16

1

0

1

0

1

3

fewer parameters!

54 of 72

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

1	-1	-1
-1	1	-1
-1	-1	1

Filter 1

1:

2:

3:

…

7:

8:

9:

…

13:

14:

15:

…

4:

10:

16:

1

0

1

0

1

3

-1

Shared weights

6 x 6 image

Fewer parameters

Even fewer parameters

55 of 72

A CNN compresses a fully connected network in two ways:

Reducing number of connections
Shared weights on the edges
Max pooling further reduces the complexity

56 of 72

Color image: RGB 3 channels

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

1	-1	-1
-1	1	-1
-1	-1	1

Filter 1

-1	1	-1
-1	1	-1
-1	1	-1

Filter 2

1	-1	-1
-1	1	-1
-1	-1	1

1	-1	-1
-1	1	-1
-1	-1	1

-1	1	-1
-1	1	-1
-1	1	-1

-1	1	-1
-1	1	-1
-1	1	-1

Color image

57 of 72

ReLU Layer

ReLU is an activation function
Activation functions are typically applied after each convolutional layer in CNNs
The main aim is to remove all the negative values from the convolution
Activation functions introduce non-linearity,

allowing the network to model complex data patterns and relationships effectively

3

0

1

0

1

3

0

58 of 72

Pooling layer

Pooling layers

also known as pooling operations
crucial component in CNNs
used for feature extraction and dimensionality reduction

These layers are inserted between convolutional layers

downsample feature maps
reducing their spatial dimensions
retain important information

Pooling is performed by implementing the following 4 steps

Pick a window size (usually 2 or 3)
Pick a stride (usually 2)
Walk your window across your filtered images
From each window, take the maximum value

Pooling helps reduce computational complexity, mitigate overfitting, and increase the depth of CNNs
In general, max pooling method is used

59 of 72

Max Pooling

3

-1

-3

-1

-3

1

0

-3

0

1

3

-2

-1

-1	1	-1
-1	1	-1
-1	1	-1

Filter 2

-1

-2

1

-1

-2

1

-1

0

-4

3

1	-1	-1
-1	1	-1
-1	-1	1

Filter 1

60 of 72

Why Pooling

Subsampling pixels will not change the object

Subsampling

bird

We can subsample the pixels to make image smaller

fewer parameters to characterize the image

61 of 72

Key characteristics and functions of pooling layers

Downsampling

Pooling layers reduce the spatial dimensions (width and height) of the feature maps

Translation Invariance

Pooling layers help achieve translation invariance in the learned features

Dimension Reduction

By reducing the spatial dimensions of feature maps, pooling layers reduce the number of parameters in the subsequent layers

62 of 72

The whole CNN

Fully Connected Feedforward network

cat dog ……

Convolution

Max Pooling

Convolution

Max Pooling

Flattened

Can repeat many times

63 of 72

Max Pooling

1	0	0	0	0	1
0	1	0	0	1	0
0	0	1	1	0	0
1	0	0	0	1	0
0	1	0	0	1	0
0	0	1	0	1	0

6 x 6 image

3

0

1

3

-1

1

3

0

2 x 2 image

New image

but smaller

Conv

Max

Pooling

64 of 72

The whole CNN

Convolution

Max Pooling

Convolution

Max Pooling

Can repeat many times

A new image

The number of channels is the number of filters

Smaller than the original image

3

0

1

3

-1

1

3

0

65 of 72

The whole CNN

Fully Connected Feedforward network

cat dog ……

Convolution

Max Pooling

Convolution

Max Pooling

Flattened

A new image

66 of 72

Flattening

3

0

1

3

-1

1

3

0

Flattened

3

0

1

3

-1

1

0

3

Fully Connected Feedforward network

67 of 72

Fully Connected Layers

CNNs primarily consist of convolutional and pooling layers for feature extraction
fully connected layers are typically found at the end of the network
Feature Aggregation

Aggregate and process the features extracted by the preceding convolutional and pooling layers.

In the context of tasks like image classification, fully connected layers make the final decisions
They map the learned features to the output classes or values, producing the network's predictions
The final fully connected layer typically has as many neurons as there are output classes
The softmax activation function is often used in this layer to produce class probabilities

68 of 72

Hierarchical Feature Extraction

Retain most information (edge detectors)

Towards more abstract representation

Encode high level concepts

Sparser representations:

Detect less (more abstract) features

69 of 72

Data Preparation

70 of 72

Basic CNN model definition

71 of 72

Model summary

72 of 72

Training

Evaluation