1 of 54

Training Deep Learning Models for Vision

Day 2

2 of 54

Convolutional Neural Networks

3 of 54

Convolutional layers

Disadvantages fully connected layers:

  • Need a weight for each pixel -> Many parameters per layer
  • Network architecture restricted to specific input size�

4 of 54

Convolutional layers

Disadvantages fully connected layers:

  • Need a weight for each pixel -> Many parameters per layer
  • Network architecture restricted to specific input size�

Use translation equivariance:

  • Encode convolution in network layer
  • Weights only for pixel neighborhood

5 of 54

Convolutional layers

Input: Image�Output: Transform of the image

*

Input image

Weights

1st hidden layer

6 of 54

Convolutional layers

*

1st hidden layer

Input: Image�Output: Transform of the image

Input image

Weights

7 of 54

Convolutional layers

*

1st hidden layer

Input: Image�Output: Transform of the image

Input image

Weights

8 of 54

Convolutional layers

*

Input: Image�Output: Transform of the image

1st hidden layer

Input image

Weights

9 of 54

Convolutional layers

*

#parameters: filter_size * input_channels * output channels�Independent of image size

1st hidden layer

Input image

Weights

10 of 54

11 of 54

12 of 54

13 of 54

Building a network

  • Chain the transforms

*

*

Input image

Weights

Weights

14 of 54

Pooling layers

Reduce layer size by “simple” operation via sliding window�Usually maxpooling

*

Some layer

Downsampling

Next layer

15 of 54

Pooling layers

*

Downsampling

Next layer

Some layer

Reduce layer size by “simple” operation via sliding window�Usually maxpooling

16 of 54

Pooling layers

*

Downsampling

Next layer

Some layer

Reduce layer size by “simple” operation via sliding window�Usually maxpooling

17 of 54

Pooling layers

*

Downsampling

Next layer

Some layer

Reduce layer size by “simple” operation via sliding window�Usually maxpooling

18 of 54

Building a network

*

*

Input image

Weights

Weights

19 of 54

Architectures for Vision

20 of 54

Architectures for vision

  • Conv layers + Pooling are building blocks�
  • Large variety of architectures�
  • More layers (usually) improve performance �(given enough data)

Adaptations for training deeper models

21 of 54

ImageNet

Image classification dataset�

  • 1000 classes
  • 1 million images���

22 of 54

ImageNet

Image classification dataset�

  • 1000 classes
  • 1 million images���

Important benchmark�in computer vision�

https://www.researchgate.net/figure/Performance-of-different-approaches-in-ImageNet-2015-competition_fig2_309392322

23 of 54

AlexNet

  • 8 layers (6 conv, 2 dense)�
  • Large filters (11x11)�
  • Introduced ReLU�
  • Won 2012 ImageNet�Challenge���������

24 of 54

VGG

  • Smaller filters (3x3)�
  • 19 layers�

��

http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture09.pdf

25 of 54

ResNet

  • Block learns residual function�
  • Prevents vanishing gradient�
  • Up to 150 layers��

https://www.pyimagesearch.com/2017/03/20/imagenet-vggnet-resnet-inception-xception-keras/�

26 of 54

More architectures

  • Many more architecture variants:
    • GoogleNet: Parallel filters of different size
    • DenseNet: Connections to all subsequent layers�
  • ResNet is a good default��

27 of 54

More architectures

  • Many more architecture variants:
    • GoogleNet: Parallel filters of different size
    • DenseNet: Connections to all subsequent layers�
  • ResNet is a good default��
  • Architectures need to be adapted for other tasks �(e.g. segmentation, object detection)
    • Image classification networks often “backbone”

28 of 54

Advanced training

29 of 54

Learning Rate

  • Controls size of gradient step�
  • Tradeoff: speed of convergence vs stability�

30 of 54

Learning Rate

  • Controls size of gradient step�
  • Tradeoff: speed of convergence vs stability�

https://www.jeremyjordan.me/nn-learning-rate/

31 of 54

Choosing a Learning Rate

  • Monitor loss (or other metric) on validation set�

32 of 54

MLP exercise with different learning rates

33 of 54

lr=1e-4

34 of 54

Choosing a Learning Rate

Learning rate schedulers:

  • Learning rate decay: decrease over training�lr = lr * 1. / (1. + decay)��
  • ReduceOnPlateau: decrease learning rate when�observed variable stops improving, e.g. validation loss

35 of 54

Optimizers

  • Mini-batch gradient descent:�vt = lr * gradt�wt = wt-1 - vt
  • Problems
    • Choosing learning rate can be difficult �
    • Can get easily trapped in suboptimal local minima

36 of 54

Optimizers

  • Mini-batch gradient descent:�vt = lr * gradt�wt = wt-1 - vt
  • Problems
    • Choosing learning rate can be difficult �
    • Can get easily trapped in suboptimal local minima

37 of 54

Optimizers

  • Momentum:�vt = gamma * vt-1 + lr * gradt
  • Speeds up convergence in “ravines”�Left: SGD, Right: SGD with Momentum

https://ruder.io/optimizing-gradient-descent/

38 of 54

Optimizers

  • Nesterov Momentum:�Compute gradient at next parameter position�
  • Speeds up convergence in “ravines”�Left: SGD, Right: SGD with Momentum

https://ruder.io/optimizing-gradient-descent/

39 of 54

Optimizers

  • Adagrad: adapt update step for each parameter individually:��vt = lr / sqrt(G + eps) * gradt�
  • G: sum of squared gradients�
  • Idea: downweight updates for parameters with large past gradients

40 of 54

Optimizers

  • Adadelta: Adagrad, but use limited number of past gradients in G��
  • Adam: scale parameter updates by past gradients and use momentum term

41 of 54

Optimizers

  • Adadelta: Adagrad, but use limited number of past gradients in G��
  • Adam: scale parameter updates by past gradients and use momentum term

42 of 54

Data Augmentation

Increasing the amount of labeled data is expensive!��Idea: use transformation to get different but valid data point

Original Image

43 of 54

Data Augmentation

Increasing the amount of labeled data is expensive!��Idea: use transformation to get different but valid data point

Original Image

Flipped horizontally

Color jitter

44 of 54

Data Augmentation

Increasing the amount of labeled data is expensive!��Idea: use transformation to get different but valid data point

Original Image

Flipped horizontally

Color jitter

Rotate 90 deg

45 of 54

Data Augmentation

Increasing the amount of labeled data is expensive!��Idea: use transformation to get different but valid data point

Original Image

Flipped horizontally

Rotate 90 deg

46 of 54

Normalization layers

Parameters initialized to zero mean and unit variance�-> this is lost with parameter updates���Keep parameters normalized during training

  • BatchNorm
  • InstanceNorm
  • GroupNorm�

47 of 54

More techniques

  • Labeled data is expensive: Pretraining
    • Use network trained on larger dataset similar to your data
    • In computer vision: often ImageNet��

48 of 54

More techniques

  • Labeled data is expensive: Pretraining
    • Use network trained on larger dataset similar to your data
    • In computer vision: often ImageNet�
  • Networks start overfitting: Early stopping
    • Monitor the loss (or other metric) on validation set
    • Take best checkpoint according to this�

49 of 54

More techniques

  • Labeled data is expensive: Pretraining
    • Use network trained on larger dataset similar to your data
    • In computer vision: often ImageNet�
  • Networks start overfitting: Early stopping
    • Monitor the loss (or other metric) on validation set
    • Take best checkpoint according to this�
  • Dropout
    • Use random subset of neurons in training�

50 of 54

Best Practices

  • Optimizer: Adam
    • LR scheduler not necessary, but can be helpful�
  • Use normalization layers
    • BatchNorm for sufficient batch sizes, otherwise Instance or GroupNorm�
  • Good data augmentations help�
  • Monitor validation metric, early stopping�

51 of 54

Exercises

52 of 54

Some critique of yesterday exercises

  • Logistic Regression is not expressive enough for good results (also with filters)
  • MLP works quite a bit better but with lower learning rate�
  • Should move filter task to MLP (after tuning LR)
  • Expressing filters via CNN should be very last exercise�

53 of 54

Some critique of yesterday exercises

The gist of the exercises�“Classical” vision pipeline: Fixed (convolutional) filters + classifier��Can express convolutional filters as convolutional layers�-> learnable��Today: learn it end to end via CNN!�

54 of 54

DL architectures on CIFAR

�Send link to your notebook on gitter (or adlcourse2020@gmail.com)