1 of 54

Training Deep Learning Models for Vision

Day 2

2 of 54

Convolutional Neural Networks

3 of 54

Convolutional layers

Disadvantages fully connected layers:

Need a weight for each pixel -> Many parameters per layer
Network architecture restricted to specific input size�

4 of 54

Convolutional layers

Disadvantages fully connected layers:

Need a weight for each pixel -> Many parameters per layer
Network architecture restricted to specific input size�

Use translation equivariance:

Encode convolution in network layer
Weights only for pixel neighborhood

5 of 54

Convolutional layers

Input: Image�Output: Transform of the image

*

Input image

Weights

1^st hidden layer

6 of 54

Convolutional layers

*

1^st hidden layer

Input: Image�Output: Transform of the image

Input image

Weights

7 of 54

Convolutional layers

*

1^st hidden layer

Input: Image�Output: Transform of the image

Input image

Weights

8 of 54

Convolutional layers

*

Input: Image�Output: Transform of the image

1^st hidden layer

Input image

Weights

9 of 54

Convolutional layers

*

#parameters: filter_size * input_channels * output channels�Independent of image size

1^st hidden layer

Input image

Weights

10 of 54

11 of 54

12 of 54

13 of 54

Building a network

Chain the transforms

*

*

Input image

Weights

14 of 54

Pooling layers

Reduce layer size by “simple” operation via sliding window�Usually maxpooling

*

Some layer

Down�sampling

Next layer

15 of 54

Pooling layers

*

Down�sampling

Next layer

Some layer

Reduce layer size by “simple” operation via sliding window�Usually maxpooling

16 of 54

Pooling layers

*

Down�sampling

Next layer

Some layer

Reduce layer size by “simple” operation via sliding window�Usually maxpooling

17 of 54

Pooling layers

*

Down�sampling

Next layer

Some layer

Reduce layer size by “simple” operation via sliding window�Usually maxpooling

18 of 54

Building a network

*

*

Input image

Weights

19 of 54

Architectures for Vision

20 of 54

Architectures for vision

Conv layers + Pooling are building blocks�
Large variety of architectures�
More layers (usually) improve performance �(given enough data)

Adaptations for training deeper models

21 of 54

ImageNet

Image classification dataset�

1000 classes
1 million images��

�

22 of 54

ImageNet

Image classification dataset�

1000 classes
1 million images��

Important benchmark�in computer vision�

https://www.researchgate.net/figure/Performance-of-different-approaches-in-ImageNet-2015-competition_fig2_309392322

23 of 54

AlexNet

8 layers (6 conv, 2 dense)�
Large filters (11x11)�
Introduced ReLU�
Won 2012 ImageNet�Challenge��

https://mc.ai/alexnet-review-and-implementation/

24 of 54

VGG

Smaller filters (3x3)�
19 layers�

��

http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture09.pdf

25 of 54

ResNet

Block learns residual function�
Prevents vanishing gradient�
Up to 150 layers��

https://www.pyimagesearch.com/2017/03/20/imagenet-vggnet-resnet-inception-xception-keras/�

26 of 54

More architectures

Many more architecture variants:

GoogleNet: Parallel filters of different size
DenseNet: Connections to all subsequent layers�

ResNet is a good default��

27 of 54

More architectures

Many more architecture variants:

GoogleNet: Parallel filters of different size
DenseNet: Connections to all subsequent layers�

ResNet is a good default��
Architectures need to be adapted for other tasks �(e.g. segmentation, object detection)

Image classification networks often “backbone”

28 of 54

Advanced training

29 of 54

Learning Rate

Controls size of gradient step�
Tradeoff: speed of convergence vs stability�

30 of 54

Learning Rate

Controls size of gradient step�
Tradeoff: speed of convergence vs stability�

https://www.jeremyjordan.me/nn-learning-rate/

31 of 54

Choosing a Learning Rate

Monitor loss (or other metric) on validation set�

32 of 54

MLP exercise with different learning rates

33 of 54

lr=1e-4

34 of 54

Choosing a Learning Rate

Learning rate schedulers:

Learning rate decay: decrease over training�lr = lr * 1. / (1. + decay)��
ReduceOnPlateau: decrease learning rate when�observed variable stops improving, e.g. validation loss

35 of 54

Optimizers

Mini-batch gradient descent:�v_t = lr * grad_t�w_t= w_t-1- v_t�
Problems

Choosing learning rate can be difficult �
Can get easily trapped in suboptimal local minima

36 of 54

Optimizers

Mini-batch gradient descent:�v_t = lr * grad_t�w_t= w_t-1- v_t�
Problems

Choosing learning rate can be difficult �
Can get easily trapped in suboptimal local minima

37 of 54

Optimizers

Momentum:�v_t= gamma * v_t-1+ lr * grad_t�
Speeds up convergence in “ravines”�Left: SGD, Right: SGD with Momentum

https://ruder.io/optimizing-gradient-descent/

38 of 54

Optimizers

Nesterov Momentum:�Compute gradient at next parameter position�
Speeds up convergence in “ravines”�Left: SGD, Right: SGD with Momentum

https://ruder.io/optimizing-gradient-descent/

39 of 54

Optimizers

Adagrad: adapt update step for each parameter individually:��v_t = lr / sqrt(G + eps) * grad_t�
G: sum of squared gradients�
Idea: downweight updates for parameters with large past gradients

40 of 54

Optimizers

Adadelta: Adagrad, but use limited number of past gradients in G��
Adam: scale parameter updates by past gradients and use momentum term

41 of 54

Optimizers

Adadelta: Adagrad, but use limited number of past gradients in G��
Adam: scale parameter updates by past gradients and use momentum term

42 of 54

Data Augmentation

Increasing the amount of labeled data is expensive!��Idea: use transformation to get different but valid data point

Original Image

43 of 54

Data Augmentation

Increasing the amount of labeled data is expensive!��Idea: use transformation to get different but valid data point

Original Image

Flipped horizontally

Color jitter

44 of 54

Data Augmentation

Increasing the amount of labeled data is expensive!��Idea: use transformation to get different but valid data point

Original Image

Flipped horizontally

Color jitter

Rotate 90 deg

45 of 54

Data Augmentation

Increasing the amount of labeled data is expensive!��Idea: use transformation to get different but valid data point

Original Image

Flipped horizontally

Rotate 90 deg

46 of 54

Normalization layers

Parameters initialized to zero mean and unit variance�-> this is lost with parameter updates��Keep parameters normalized during training

BatchNorm
InstanceNorm
GroupNorm�

47 of 54

More techniques

Labeled data is expensive: Pretraining

Use network trained on larger dataset similar to your data
In computer vision: often ImageNet��

48 of 54

More techniques

Labeled data is expensive: Pretraining

Use network trained on larger dataset similar to your data
In computer vision: often ImageNet�

Networks start overfitting: Early stopping

Monitor the loss (or other metric) on validation set
Take best checkpoint according to this�

49 of 54

More techniques

Labeled data is expensive: Pretraining

Use network trained on larger dataset similar to your data
In computer vision: often ImageNet�

Networks start overfitting: Early stopping

Monitor the loss (or other metric) on validation set
Take best checkpoint according to this�

Dropout

Use random subset of neurons in training�

50 of 54

Best Practices

Optimizer: Adam

LR scheduler not necessary, but can be helpful�

Use normalization layers

BatchNorm for sufficient batch sizes, otherwise Instance or GroupNorm�

Good data augmentations help�
Monitor validation metric, early stopping�

51 of 54

Exercises

52 of 54

Some critique of yesterday exercises

Logistic Regression is not expressive enough for good results (also with filters)
MLP works quite a bit better but with lower learning rate�
Should move filter task to MLP (after tuning LR)
Expressing filters via CNN should be very last exercise�

53 of 54

Some critique of yesterday exercises

The gist of the exercises�“Classical” vision pipeline: Fixed (convolutional) filters + classifier��Can express convolutional filters as convolutional layers�-> learnable��Today: learn it end to end via CNN!�

54 of 54

DL architectures on CIFAR

�Send link to your notebook on gitter (or adlcourse2020@gmail.com)