1 of 52

Be your own Teacher: Improve the Performance of Convolutional Neural Networks via Distillation

May2019, Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, Kaisheng Ma

Presented by:

Marc-André Ruel and Yan Côté

2 of 52

AGENDA: Be Your Own Teacher

  1. Introduction
  2. Related Work
  3. Self-Distillation Framework
  4. Experiments
  5. Discussion and Future Works
  6. Conclusion
  7. Questions

3 of 52

Introduction: Motivation

Accuracy

Inference Time

Paradox in some applications:

  1. intolerant to error(Medical, automated Driving)
  2. fast inference response
  3. ability to meet HW limitations (Edge Device)

4 of 52

Introduction: Compression Approaches(May 2019)

Knowledge Distillation(2014,Geoffrey Hinton et al)

Lightweight network Design (Squeezenet / MobileNet CNNs)

Pruning: Removing small-weight Connections

Quantization (INT8, INT16, FP32)

1Distilling the knowledge in Neural Network, Geoffrey Hinton, Oriol Vinyals, Jeff Dean

(Left) SqueezeNet Architecture

(Right) Fire modules content

5 of 52

Introduction: Knowledge Distillation Setback

Inspired by teacher/student model, Low efficiency in knowledge Transfer: aim to replace over-parameterized teacher with a compact student resulting in compression/accelerations.

Challenges:

  1. Low efficiency in knowledge transfer.
  2. Designing a proper teacher model and find the best architecture.

6 of 52

Introduction: Proposition

One-Step self-distillation Framework

  • Requires less training time(4.6X)
  • Accomplish higher accuracy of the baseline and traditional self-distillation.

Comparison of training complexity, training time, and accuracy between traditional distillation and proposed self distillation (reported on CIFAR100).

7 of 52

Introduction: Contributions

  • Self distillation improves the performance of CNN networks by a large margin at no expense of response time. 2.65% accuracy boost on average.
  • Self distillation provides a single neural network executable at different depth, permitting adaptive accuracy-efficiency trade-offs on resource-limited edge devices.
  • Experiments for five kinds of CNN networks on two kinds of datasets to prove the generalization of this technique.

8 of 52

Self Distillation Framework

9 of 52

Self-Distillation Framework: Presentation

  • Divide the Target CNN in several shallow sections
  • Add a BottleNeck and a FC layer on each section for training

Figure showing the details of a ResNet equipped with self distillation. ResNet has been divided into four sections (ii) Additional bottleneck and FC layers are set after each section, (iii) All of the classifiers can be utilized independently, with different accuracy and response time. (iv) Each classifier is trained under three kinds of supervision as depicted. (v) Parts under the dash line can be removed in inference.

10 of 52

Self-Distillation Framework: BottleNeck

The BottleNeck are responsible for reducing then restoring dimensions.

The practical reason to use BottleNeck is mainly to decrease model size which also slightly decreases accuracy.

Here it is also used for the hint loss to mitigate different feature map sizes.

A deeper residual function F for ImageNet. Left: a ResNet standard building block Right: a “bottleneck” building block for ResNet-50/101/152.

11 of 52

Separable Convolution

  • Reduce the model size dramatically.

Images from Kunlun Bai

2D convolution

1)

2)

Depthwise Separable Convolutions (2 steps)

  • In the final version of the code, they don’t use bottle necks; they use SepConv layers. They are upsampling the number of channels to match the output of the 4/4 classifier.
  • Less operations for depthwise separable convolutions compared to 2D convolutions.
  • Used in MobileNet.

12 of 52

Model performance - 4/4

  • Deepest model, they use all the ResNet blocks (4/4) as the teacher network.
  • Can be your typical ResNet model.
  • Best Accuracy, but slowest Acceleration. Not typically what you would run on edge devices.

13 of 52

Model performance - 1/4

  • Model that uses one out of the four (1/4) ResNet blocks.
  • Lower accuracy but faster execution time since it has less params.

14 of 52

Model performance - 3/4

  • Contains 3 out of the 4 ResNet layers (3/4)
  • Compromise between Accuracy and Acceleration

15 of 52

Self-Distillation Framework: Loss Sources

Loss Source 1: Cross-Entropy from labels to all Classifiers(C Classifiers / M Classes)

Loss Source 2: KL Divergence under Teacher guidance

Loss Source 3: L2 Loss from hints

Supervised Learning on all Classifiers

Shallow Classifier approx. Deeper Classifier

Shallow Representation approx. Deeper Representation

16 of 52

Loss Source 1 - Cross entropy from labels

Computed with all labels, the knowledge is directly sent to all Classifiers. This is the only loss which apply to all Classifiers

Loss Source 1: Cross-Entropy from labels to all Classifiers(C Classifiers / M Classes)

Supervised Learning on all Classifiers

17 of 52

Loss Source 2 - KL divergence under teacher

Computed between Teacher(Deeper Classifiers - FC Layer 4)) and Students(All others Classifiers). We aim to make shallow Classifier very close to the Deepest Classifier.

Hyper-param Temperature (T) is used to soften the probability distribution of outputs. In KD, it’s used to highlight more information in an otherwise skewed distribution.

Loss Source 2: KL Divergence under Teacher guidance

Shallow Classifier approx. Deeper Classifier

18 of 52

Loss Source 2 - KL divergence under teacher

Computed between Teacher(Deeper Classifiers - FC Layer 4)) and Students(All others Classifiers). We aim to make shallow Classifier very close to the Deepest Classifier.

Hyper-param Temperature (T) is used to soften the probability distribution of outputs. In KD, it’s used to highlight more information in an otherwise skewed distribution.

Loss Source 2: KL Divergence under Teacher guidance

Shallow Classifier approx. Deeper Classifier

19 of 52

Loss source 3 - L2 loss from hints

We use Teacher’s hidden representation as a hint to train Students aiming at minimize the distance between the feature maps. The Bottleneck show positive result to adapt Layer’s depth representation.

Loss Source 3: L2 Loss from hints

Shallow Representation approx. Deeper Representation

20 of 52

Experiments

21 of 52

Experiments - Models/Data

  • ResNet
  • WideResNet
  • Pyramid ResNet
  • ResNeXt(Wider/Efficient)
  • VGG

Additive(a) multiplicative(b) Pyramid ResNet

ResNet-50 Architecture

ResNext

  • CIFAR 100
  • ImageNet

22 of 52

Experiment: Compare with baseline

  1. All configurations benefit from self-distillation
  2. Deeper neural networks have more improvement when using self-distillation
  3. Ensemble improve performance on CIFAR but mitigated results on ImageNet(Consensus among classifiers)

*Numbers in red don’t beat the baseline.

Equivalent Models

23 of 52

Experiment: Compared with distillation

Self-distillation always outperforms all other methods even if there is no additional model to teach.

Knowledge Distillation [2014]

FitNets [2015]

Attention Transfer [2017]

Deep Mutual Learning [2017]

24 of 52

Comparaison Model: Deeply Supervised Nets

Previous paper which use SVM on the different Feature Maps, BackPropagation of errors throught these intermediary linear classifiers.�Very similar to our article, but the main differences is that deeper layers are trained only from labels… (and the use of bottlenecks/cnn layers for intermediate classifiers?)

25 of 52

Experiment: Compared with Deeply Supervised Net

  1. Self-Distillation outperforms DSN in all configurations
  2. Shallow classifiers benefit more from Self-Distillation. (i.e. classifier 1/4, 2/4 and 3/4 )

average +0.88 +1.18 +1.17 +0.26

26 of 52

Experiment: Scalable depth for Adapting inference

  • The approach provides a scalable solution allowing to trade accuracy and response time dynamically.
  • Equivalent models always outperforms the baseline.
  • There’s not a big cost when for the Ensemble method for extra Accuracy.
  • 3 out of 4 “Classifier 3/4” outperform the baseline (highlighted in table)

Equivalent Models

27 of 52

Discussion and Future Works

28 of 52

Discussion: Flat minimum

Self Distillation help model converge to flat minimum which help generalization

Although shallow networks perform well when training, they are far behind over-parameterized model on test set

Explanation: Over-parametrized models converge to flat minimum more easily while shallow network can be caught in sharp minimum sensitive to the bias of the data.

(top)Intuitive explanation of the difference between flat/sharp minimums.

(bottom) Comparing train/test accuracy with Gaussian noise injected on a 18-L ResNet on CIFAR100

Model Trained with self-distillation are more flat, and more robust to perturbation of parameters

29 of 52

Discussion: Preventing vanishing gradient

Inspecting mean gradient between two identicals architectures, self-distillation gradient is larger especially in the first two blocks.

It inherits the ability of Deeply Supervised Networks (DSN) to address the vanishing gradient problem to some extent. (idea is that each layer has a direct supervision from the ground truth labels which prevent gradient to vanish)

Mean magnitude of gradients in each layer of two 18-Layers ResNet trained with and without self-distillation

30 of 52

Discussion: More discriminative features

More discriminating features are extracted with classifiers with more capacity (3/4 and 4/4 ) in self distillation.

Points are more scattered in the two first ResBlock layers (Classifier 1/4 and 2/4) an more clustered in the shallow clusters(Classifier 3/4 and 4/4)

SSE: Sum of squares due to error within groups

  • Smaller SSE, denser clusters

SSB: Sum of squares between groups(Distance)

  • Bigger SSB, more discriminating

31 of 52

Conclusion / Takeaways

A lot of hyperparam tuning since more than one loss function… can be complex to optimise properly?

Self-distillation is beneficial in all configurations without having to design an efficient teacher.

The architecture provides flexibility to balance accuracy and inference time.

32 of 52

Self-Distillation Amplifies Regularization in Hilbert Space

2020, Hossein Mobahi, Mehrdad Farajtabar, Peter L. Bartlett

Presented by:

Marc-André Ruel and Yan Côté

33 of 52

AGENDA: Self-Distillation Amplifies Regularization in Hilbert Space

  • Introduction
  • Setup
  • Self-Distillation
  • Experiments
  • Conclusion

34 of 52

Introduction

Self-Distillation:

“It has been empirically observed that the self-distilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training”

35 of 52

Setup : Making it Mathematically easier to solve

Fit a nonlinear function to training data subject to L2, K points, Total loss < epsilon.

Add a regularization (bias on the functions we prefer), Input=Function, Output = Smoothness. Using Mercer’s Conditions, u(x,x*) is a Kernel function

Relates to Neural Tangent Network(NTK) with infinite width and L2 loss is equivalent to Kernel Regression Network in this regime could be simplify to linear models with a kernel called the neural tangent

36 of 52

Setup : Closed Form Solution

Using the Karush-Kuhn-Tucker condition(Allow to solve nonlinear system under constraint for a Closed Form for f* (c weight regularization here)

Let’s the kernel g(x,t) be a Green’s function of the linear operator

Problem has one closed-form solution

G is Positive Definite

G = VTDT

c > 1, root of

(cI+G) is invertible

37 of 52

Self-Distillation

Doing (t) rounds of distillations

Rotated Label

V is orthogonal

Decomposition

Non Linear Recurrent because of ct

No closed form solution

Paper develops bounds

38 of 52

Self-Distillation: Evolution of basis

Starting with what we have and assuming c is constant, the only part dependant part is the diagonal Matrix B

Looking at pairs of diagonals, it progressively sparsifies B which affect the expressiveness of the regression solution. It affects the number of basis functions used in the representation. As distillation process goes on, we are only allowed to use less columns of the original labels. This is how self-distillation impose capacity control

39 of 52

Self-Distillation: Obvious Solutions

We can bound the number (t) before the solution collapse. They show that Smaller epsilon yield to more sparse solution.

K = number of points

epsilon = training error after each round

Kappa = Condition Number of Matrix G

(Collapse Mode ) Trivial Solution with f*(x) = 0

40 of 52

Self-Distillation: differences with Early Stopping

Early stopping by increasing epsilon threshold rather than limiting epochs. It affect only c0.

Early stopping does the opposite of sparsification. At best, it maintains actual sparsity, otherwise make it even denser.

41 of 52

Experiments: Example with Sin Function

42 of 52

Experiments: Train/Val accuracy on CIFAR100

ResNet and VGG on CIFAR-10/100

10 Run with random weight on different step of distillation

Training loss constantly decrease while test loss peaks and then fall down.

L2 Loss and cross-entropy show similar curves

43 of 52

Experiments: Early stopping and cross-entropy

Early stopping is not Self-distillation

Use self-distillation training loss at each round to to early stopping

44 of 52

Conclusions / Takeaways

  • Can’t self-distillate infinitely, the solution eventually collapses.
  • The basis functions initially used to represent the solution gradually change to a more sparse representation.
    • models operating in near-interpolation regime(small epsilon) achieves a higher sparsity level
  • Early stopping also has a regularization effect, but a different one.
    • Makes the space of possible functions denser (the opposite of sparsification)

Open Questions:

  • Can we achieved same kind of regularization but more efficiently ?
  • Study differences with cross-entropy

45 of 52

QUESTIONS

46 of 52

Question: Generalization

In [11](On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, Nitish Shirish Keskar et al, Intel, 2016.), they observed using numerical evidences that it is closely related to the Batch size used in training. Using small gradient pushes the iteration out of basin of attraction, but they raised severals questions.

That’s a good question for Aaron !

47 of 52

Question: Features Maps

Referring to the figure below, it seems like each blocs keeps a representation which is gradual between low-level feature and high-level feature. Since the loss is composed of 3 losses, for each ResBlock, we used the labels but also the distance between with the last layer representation and classifier. We can suspect that the balance of the three factors do not produce ResBlock which are too similar.

An additional factor is the FC/BottleNeck elements in each branches of the architecture might keep some specific features with respect to each classifier and the main backbone keeps representations more generic to accommodate each loss functions.

48 of 52

Question: adding Layers in inference

All the layers or blocs below the dash line are optionals. If you add them in inference, you might want to switch dynamically between Classifiers if response time become critical. Like if they can be turned ON/OFF. But influence on the inference should be a lower/higher accuracy as reported in Table 1.

49 of 52

Loss ablation

One ablation study was tested with DSN. But according to some comments on the Github issue list, it does seem like a very relevant experiment to run since it’s not clear what can be attributed to the loss and what should be attributed to the new architecture. (Bottlenecks and SepConv heads)

50 of 52

They do seem equivalent since it would serve to help vanishing gradient issues. But the goal is to also generate compressed classifiers for edge devices.

51 of 52

REFERENCE

[1] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2014

[2] Junpeng Wang et al, DeepVID: Deep Visual Interpretation and Diagnosis for Image Classifiers via Knowledge Distillation,

[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2016

[4] Dongyoon Han, Jiwhan Kim, Junmo Kim, Deep Pyramidal Residual Networks,2016

[5] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, Yoshua Bengio, FitNets: Hints for Thin Deep Nets 2014

[6] Sergey Zagoruyko, Nikos Komodakis, Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer, 2017

[7] Ying Zhang, Tao Xiang, Timothy M. Hospedales, Huchuan Lu, Deep Mutual Learning, 2017

[8] Nitish Shirish Keskar, Dheevatsa Mudigere, et al, On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, 2017

[9] Hossein Mobahi, Mehrdad Farajtabar, Peter L. Bartlett, Self-Distillation Amplifies Regularization in Hilbert Space, 2020

[10] https://twitter.com/thegradient/status/1228132843630387201?lang=en

[11] On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, Nitish Shirish Keskar et al, 2016.

52 of 52

Optional Questions

  • When dropping ResNet blocks (BlockDrop) the goal does seem similar in a way, since when we remove ResNet blocks and the backpropagation is calculated, it does seem in to help the vanishing gradient issue by forcing the loss output to pass through less layers just like this paper is doing:
    • BlockDrop: Dynamic Inference Paths in Residual Networks; https://arxiv.org/pdf/1711.08393.pdf