Training Deep Learning Models for Vision

Day 2

Convolutional Neural Networks

Convolutional layers

Disadvantages fully connected layers:

  • Need a weight for each pixel -> Many parameters per layer
  • Network architecture restricted to specific input size�

Convolutional layers

Use translation equivariance:

  • Encode convolution in network layer
  • Weights only for pixel neighborhood

Convolutional layers

Input: Image�Output: Transform of the image


Input image


1st hidden layer

Convolutional layers


#parameters: filter_size * input_channels * output channels�Independent of image size

1st hidden layer

Input image


Building a network

  • Chain the transforms



Input image



Pooling layers

Reduce layer size by “simple” operation via sliding window�Usually maxpooling


Some layer


Next layer

Building a network



Input image



Architectures for Vision

Architectures for vision

  • Conv layers + Pooling are building blocks�
  • Large variety of architectures�
  • More layers (usually) improve performance �(given enough data)

Adaptations for training deeper models

Image classification dataset�

  • 1000 classes
  • 1 million images���

Image classification dataset�

  • 1000 classes
  • 1 million images���

Important benchmark�in computer vision�


  • 8 layers (6 conv, 2 dense)�
  • Large filters (11x11)�
  • Introduced ReLU�
  • Won 2012 ImageNet�Challenge���������

  • Smaller filters (3x3)�
  • 19 layers�



  • Block learns residual function�
  • Prevents vanishing gradient�
  • Up to 150 layers��


More architectures

  • Many more architecture variants:
    • GoogleNet: Parallel filters of different size
    • DenseNet: Connections to all subsequent layers�
  • ResNet is a good default��

  • Architectures need to be adapted for other tasks �(e.g. segmentation, object detection)
    • Image classification networks often “backbone”

Advanced training

Learning Rate

  • Controls size of gradient step�
  • Tradeoff: speed of convergence vs stability�

Choosing a Learning Rate

  • Monitor loss (or other metric) on validation set�

MLP exercise with different learning rates

Choosing a Learning Rate

Learning rate schedulers:

  • Learning rate decay: decrease over training�lr = lr * 1. / (1. + decay)��
  • ReduceOnPlateau: decrease learning rate when�observed variable stops improving, e.g. validation loss

  • Mini-batch gradient descent:�vt = lr * gradt�wt = wt-1 - vt
  • Problems
    • Choosing learning rate can be difficult �
    • Can get easily trapped in suboptimal local minima

  • Momentum:�vt = gamma * vt-1 + lr * gradt
  • Speeds up convergence in “ravines”�Left: SGD, Right: SGD with Momentum


  • Adagrad: adapt update step for each parameter individually:��vt = lr / sqrt(G + eps) * gradt�
  • G: sum of squared gradients�
  • Idea: downweight updates for parameters with large past gradients

  • Adadelta: Adagrad, but use limited number of past gradients in G��
  • Adam: scale parameter updates by past gradients and use momentum term

Data Augmentation

Increasing the amount of labeled data is expensive!��Idea: use transformation to get different but valid data point

Original Image

Data Augmentation

Increasing the amount of labeled data is expensive!��Idea: use transformation to get different but valid data point

Original Image

Flipped horizontally

Color jitter

Data Augmentation

Increasing the amount of labeled data is expensive!��Idea: use transformation to get different but valid data point

Original Image

Flipped horizontally

Color jitter

Rotate 90 deg

Data Augmentation

Increasing the amount of labeled data is expensive!��Idea: use transformation to get different but valid data point

Original Image

Flipped horizontally

Rotate 90 deg

Normalization layers

Parameters initialized to zero mean and unit variance�-> this is lost with parameter updates���Keep parameters normalized during training

  • BatchNorm
  • InstanceNorm
  • GroupNorm�

More techniques

  • Labeled data is expensive: Pretraining
    • Use network trained on larger dataset similar to your data
    • In computer vision: often ImageNet��

More techniques

  • Networks start overfitting: Early stopping
    • Monitor the loss (or other metric) on validation set
    • Take best checkpoint according to this�

More techniques

  • Dropout
    • Use random subset of neurons in training�

Best Practices

  • Optimizer: Adam
    • LR scheduler not necessary, but can be helpful�
  • Use normalization layers
    • BatchNorm for sufficient batch sizes, otherwise Instance or GroupNorm�
  • Good data augmentations help�
  • Monitor validation metric, early stopping�

Some critique of yesterday exercises

  • Logistic Regression is not expressive enough for good results (also with filters)
  • MLP works quite a bit better but with lower learning rate�
  • Should move filter task to MLP (after tuning LR)
  • Expressing filters via CNN should be very last exercise�

Some critique of yesterday exercises

The gist of the exercises�“Classical” vision pipeline: Fixed (convolutional) filters + classifier��Can express convolutional filters as convolutional layers�-> learnable��Today: learn it end to end via CNN!�

DL architectures on CIFAR

