2 of 31

Deep Learning for Computer Vision: Review

Source: 6.S191 Intro. to Deep Learning at MIT

3 of 31

Convolutional Autoencoder

4 of 31

Convolutional Autoencoder

Motivation: image to autoencoder ?
Convolutional autoencoder extends the basic structure of the simple autoencoder by changing the fully connected layers to convolution layers.

the network of encoder change to convolution layers
the network of decoder change to transposed convolutional layers

A transposed 2-D convolution layer upsamples feature maps.
This layer is sometimes incorrectly known as a "deconvolution" or "deconv" layer.
This layer is the transpose of convolution and does not perform deconvolution.

downsample

upsample

5 of 31

tf.keras.models.Conv2D

Encoder
Padding

padding = ‘VALID’

strides = [1, 1, 1, 1]

6 of 31

tf.keras.models.Conv2D

Encoder
Padding

padding = ‘VALID’

strides = [1, 1, 1, 1]

padding = ‘SAME’

strides = [1, 1, 1, 1]

7 of 31

tf.keras.models.Conv2D

Encoder
Stride

padding = ‘SAME’

strides = [1, 1, 1, 1]

padding = ‘SAME’

strides = [1, 2, 2, 1]

8 of 31

tf.keras.models.Conv2DTranspose

Decoder
Stride

padding = ‘VALID’

strides = (1,1)

padding = ‘VALID’

strides = (1,1)

9 of 31

tf.keras.models.Conv2DTranspose

Decoder
Stride

padding = ‘VALID’

strides = (2,2)

padding = ‘VALID’

strides = (2,2)

10 of 31

tf.keras.models.Conv2DTranspose

Decoder
Stride

padding = ‘SAME’

strides = (2,2)

padding = ‘SAME’

strides = (2,2)

11 of 31

CAE Implementation

Fully convolutional
Note that no dense layer is used

12 of 31

CAE Implementation

13 of 31

CAE Implementation

14 of 31

CAE Implementation

15 of 31

Reconstruction Result

16 of 31

Segmentation

Segmentation task is different from classification task because it requires predicting a class for each pixel of the input image, instead of only 1 class for the whole input.
Segment images into regions with different semantic categories. These semantic regions label and predict objects at the pixel level

Image from http://d2l.ai/

17 of 31

Segmentation

Segmentation task is different from classification task because it requires predicting a class for each pixel of the input image, instead of only 1 class for the whole input.
Segment images into regions with different semantic categories. These semantic regions label and predict objects at the pixel level
Classification needs to understand what is in the input (namely, the context).
However, in order to predict what is in the input for each pixel, segmentation needs to recover not only what is in the input, but also where.

Image from http://d2l.ai/

18 of 31

Semantic Segmentation: FCNs

FCN uses a convolutional neural network to transform image pixels to pixel categories.

Network designed with all convolutional layers, with down-sampling and up-sampling operations

Given a position on the spatial dimension, the output of the channel dimension will be a category prediction of the pixel corresponding to the location.

Image from http://d2l.ai/

19 of 31

From CAE to FCN

20 of 31

From CAE to FCN

21 of 31

Skip Connection

A skip connection is a connection that bypasses at least one layer.

Here, it is often used to transfer local information by summing feature maps from the downsampling path with feature maps from the upsampling path.

Merging features from various resolution levels helps combining context information with spatial information.

22 of 31

ResNet (Deep Residual Learning)

He, Kaiming, et al. “Deep residual learning for image recognition.” CVPR. 2016.
Plain net

23 of 31

ResNet (Deep Residual Learning)

He, Kaiming, et al. "Deep residual learning for image recognition." CVPR. 2016.
Residual net
Skip connection

- A direct connection between 2 non-consecutive layers

- No gradient vanishing

24 of 31

ResNet (Deep Residual Learning)

If identity were optimal, easy to set weights as 0

If optimal mapping is closer to identity, easier to find small fluctuations

25 of 31

Residual Net

26 of 31

Fully Convolutional Networks (FCNs)

To obtain a segmentation map (output), segmentation networks usually have 2 parts

Downsampling path: capture semantic/contextual information
Upsampling path: recover spatial information

The downsampling path is used to extract and interpret the context (what), while the upsampling path is used to enable precise localization (where).

Furthermore, to fully recover the fine-grained spatial information lost in the pooling or downsampling layers, we often use skip connections.

Network can work regardless of the original image size, without requiring any fixed number of units at any stage.

27 of 31

Segmented (Labeled) Images

input

output

28 of 31

FCN Architecture

Fixed

maxp3

maxp4

fcn4

fcn3

fcn2

fcn1

Trained

29 of 31

FCN Architecture

Fixed

maxp3

maxp4

fcn4

fcn3

fcn2

fcn1

Trained

30 of 31

FCN Architecture

Fixed

maxp3

maxp4

fcn4

fcn3

fcn2

fcn1

Trained

31 of 31

Segmentation Result

maxp3

maxp4

input

Segmentation output

overlapping