Be your own Teacher: Improve the Performance of Convolutional Neural Networks via Distillation
May2019, Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, Kaisheng Ma
Presented by:
Marc-André Ruel and Yan Côté
AGENDA: Be Your Own Teacher
Introduction: Motivation
Accuracy
Inference Time
Paradox in some applications:
Introduction: Compression Approaches(May 2019)
Knowledge Distillation(2014,Geoffrey Hinton et al)
Lightweight network Design (Squeezenet / MobileNet CNNs)
Pruning: Removing small-weight Connections
Quantization (INT8, INT16, FP32)
1Distilling the knowledge in Neural Network, Geoffrey Hinton, Oriol Vinyals, Jeff Dean
(Left) SqueezeNet Architecture
(Right) Fire modules content
Introduction: Knowledge Distillation Setback
Inspired by teacher/student model, Low efficiency in knowledge Transfer: aim to replace over-parameterized teacher with a compact student resulting in compression/accelerations.
Challenges:
Introduction: Proposition
One-Step self-distillation Framework
Comparison of training complexity, training time, and accuracy between traditional distillation and proposed self distillation (reported on CIFAR100).
Introduction: Contributions
Self Distillation Framework
Self-Distillation Framework: Presentation
Figure showing the details of a ResNet equipped with self distillation. ResNet has been divided into four sections (ii) Additional bottleneck and FC layers are set after each section, (iii) All of the classifiers can be utilized independently, with different accuracy and response time. (iv) Each classifier is trained under three kinds of supervision as depicted. (v) Parts under the dash line can be removed in inference.
Self-Distillation Framework: BottleNeck
The BottleNeck are responsible for reducing then restoring dimensions.
The practical reason to use BottleNeck is mainly to decrease model size which also slightly decreases accuracy.
Here it is also used for the hint loss to mitigate different feature map sizes.
A deeper residual function F for ImageNet. Left: a ResNet standard building block Right: a “bottleneck” building block for ResNet-50/101/152.
Separable Convolution
Images from Kunlun Bai
2D convolution
1)
2)
Depthwise Separable Convolutions (2 steps)
Model performance - 4/4
Model performance - 1/4
Model performance - 3/4
Self-Distillation Framework: Loss Sources
Loss Source 1: Cross-Entropy from labels to all Classifiers(C Classifiers / M Classes)
Loss Source 2: KL Divergence under Teacher guidance
Loss Source 3: L2 Loss from hints
Supervised Learning on all Classifiers
Shallow Classifier approx. Deeper Classifier
Shallow Representation approx. Deeper Representation
Loss Source 1 - Cross entropy from labels
Computed with all labels, the knowledge is directly sent to all Classifiers. This is the only loss which apply to all Classifiers
Loss Source 1: Cross-Entropy from labels to all Classifiers(C Classifiers / M Classes)
Supervised Learning on all Classifiers
Loss Source 2 - KL divergence under teacher
Computed between Teacher(Deeper Classifiers - FC Layer 4)) and Students(All others Classifiers). We aim to make shallow Classifier very close to the Deepest Classifier.
Hyper-param Temperature (T) is used to soften the probability distribution of outputs. In KD, it’s used to highlight more information in an otherwise skewed distribution.
Loss Source 2: KL Divergence under Teacher guidance
Shallow Classifier approx. Deeper Classifier
Loss Source 2 - KL divergence under teacher
Computed between Teacher(Deeper Classifiers - FC Layer 4)) and Students(All others Classifiers). We aim to make shallow Classifier very close to the Deepest Classifier.
Hyper-param Temperature (T) is used to soften the probability distribution of outputs. In KD, it’s used to highlight more information in an otherwise skewed distribution.
Loss Source 2: KL Divergence under Teacher guidance
Shallow Classifier approx. Deeper Classifier
Loss source 3 - L2 loss from hints
We use Teacher’s hidden representation as a hint to train Students aiming at minimize the distance between the feature maps. The Bottleneck show positive result to adapt Layer’s depth representation.
Loss Source 3: L2 Loss from hints
Shallow Representation approx. Deeper Representation
Experiments
Experiments - Models/Data
Additive(a) multiplicative(b) Pyramid ResNet
ResNet-50 Architecture
ResNext
Experiment: Compare with baseline
*Numbers in red don’t beat the baseline.
Equivalent Models
Experiment: Compared with distillation
Self-distillation always outperforms all other methods even if there is no additional model to teach.
Knowledge Distillation [2014]
FitNets [2015]
Attention Transfer [2017]
Deep Mutual Learning [2017]
Comparaison Model: Deeply Supervised Nets
Previous paper which use SVM on the different Feature Maps, BackPropagation of errors throught these intermediary linear classifiers.�Very similar to our article, but the main differences is that deeper layers are trained only from labels… (and the use of bottlenecks/cnn layers for intermediate classifiers?)
Experiment: Compared with Deeply Supervised Net
average +0.88 +1.18 +1.17 +0.26
Experiment: Scalable depth for Adapting inference
Equivalent Models
Discussion and Future Works
Discussion: Flat minimum
Self Distillation help model converge to flat minimum which help generalization
Although shallow networks perform well when training, they are far behind over-parameterized model on test set
Explanation: Over-parametrized models converge to flat minimum more easily while shallow network can be caught in sharp minimum sensitive to the bias of the data.
(top)Intuitive explanation of the difference between flat/sharp minimums.
(bottom) Comparing train/test accuracy with Gaussian noise injected on a 18-L ResNet on CIFAR100
Model Trained with self-distillation are more flat, and more robust to perturbation of parameters
Discussion: Preventing vanishing gradient
Inspecting mean gradient between two identicals architectures, self-distillation gradient is larger especially in the first two blocks.
It inherits the ability of Deeply Supervised Networks (DSN) to address the vanishing gradient problem to some extent. (idea is that each layer has a direct supervision from the ground truth labels which prevent gradient to vanish)
Mean magnitude of gradients in each layer of two 18-Layers ResNet trained with and without self-distillation
Discussion: More discriminative features
More discriminating features are extracted with classifiers with more capacity (3/4 and 4/4 ) in self distillation.
Points are more scattered in the two first ResBlock layers (Classifier 1/4 and 2/4) an more clustered in the shallow clusters(Classifier 3/4 and 4/4)
SSE: Sum of squares due to error within groups
SSB: Sum of squares between groups(Distance)
Conclusion / Takeaways
A lot of hyperparam tuning since more than one loss function… can be complex to optimise properly?
Self-distillation is beneficial in all configurations without having to design an efficient teacher.
The architecture provides flexibility to balance accuracy and inference time.
Self-Distillation Amplifies Regularization in Hilbert Space
2020, Hossein Mobahi, Mehrdad Farajtabar, Peter L. Bartlett
Presented by:
Marc-André Ruel and Yan Côté
AGENDA: Self-Distillation Amplifies Regularization in Hilbert Space
Introduction
Self-Distillation:
“It has been empirically observed that the self-distilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training”
Setup : Making it Mathematically easier to solve
Fit a nonlinear function to training data subject to L2, K points, Total loss < epsilon.
Add a regularization (bias on the functions we prefer), Input=Function, Output = Smoothness. Using Mercer’s Conditions, u(x,x*) is a Kernel function
Relates to Neural Tangent Network(NTK) with infinite width and L2 loss is equivalent to Kernel Regression Network in this regime could be simplify to linear models with a kernel called the neural tangent
Setup : Closed Form Solution
Using the Karush-Kuhn-Tucker condition(Allow to solve nonlinear system under constraint for a Closed Form for f* (c weight regularization here)
Let’s the kernel g(x,t) be a Green’s function of the linear operator
Problem has one closed-form solution
G is Positive Definite
G = VTDT
c > 1, root of
(cI+G) is invertible
Self-Distillation
Doing (t) rounds of distillations
Rotated Label
V is orthogonal
Decomposition
Non Linear Recurrent because of ct
No closed form solution
Paper develops bounds
Self-Distillation: Evolution of basis
Starting with what we have and assuming c is constant, the only part dependant part is the diagonal Matrix B
Looking at pairs of diagonals, it progressively sparsifies B which affect the expressiveness of the regression solution. It affects the number of basis functions used in the representation. As distillation process goes on, we are only allowed to use less columns of the original labels. This is how self-distillation impose capacity control
Self-Distillation: Obvious Solutions
We can bound the number (t) before the solution collapse. They show that Smaller epsilon yield to more sparse solution.
K = number of points
epsilon = training error after each round
Kappa = Condition Number of Matrix G
(Collapse Mode ) Trivial Solution with f*(x) = 0
Self-Distillation: differences with Early Stopping
Early stopping by increasing epsilon threshold rather than limiting epochs. It affect only c0.
Early stopping does the opposite of sparsification. At best, it maintains actual sparsity, otherwise make it even denser.
Experiments: Example with Sin Function
Experiments: Train/Val accuracy on CIFAR100
ResNet and VGG on CIFAR-10/100
10 Run with random weight on different step of distillation
Training loss constantly decrease while test loss peaks and then fall down.
L2 Loss and cross-entropy show similar curves
Experiments: Early stopping and cross-entropy
Early stopping is not Self-distillation
Use self-distillation training loss at each round to to early stopping
Conclusions / Takeaways
Open Questions:
QUESTIONS
Question: Generalization
In [11](On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, Nitish Shirish Keskar et al, Intel, 2016.), they observed using numerical evidences that it is closely related to the Batch size used in training. Using small gradient pushes the iteration out of basin of attraction, but they raised severals questions.
That’s a good question for Aaron !
Question: Features Maps
Referring to the figure below, it seems like each blocs keeps a representation which is gradual between low-level feature and high-level feature. Since the loss is composed of 3 losses, for each ResBlock, we used the labels but also the distance between with the last layer representation and classifier. We can suspect that the balance of the three factors do not produce ResBlock which are too similar.
An additional factor is the FC/BottleNeck elements in each branches of the architecture might keep some specific features with respect to each classifier and the main backbone keeps representations more generic to accommodate each loss functions.
Question: adding Layers in inference
All the layers or blocs below the dash line are optionals. If you add them in inference, you might want to switch dynamically between Classifiers if response time become critical. Like if they can be turned ON/OFF. But influence on the inference should be a lower/higher accuracy as reported in Table 1.
Loss ablation
One ablation study was tested with DSN. But according to some comments on the Github issue list, it does seem like a very relevant experiment to run since it’s not clear what can be attributed to the loss and what should be attributed to the new architecture. (Bottlenecks and SepConv heads)
They do seem equivalent since it would serve to help vanishing gradient issues. But the goal is to also generate compressed classifiers for edge devices.
REFERENCE
[1] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2014
[2] Junpeng Wang et al, DeepVID: Deep Visual Interpretation and Diagnosis for Image Classifiers via Knowledge Distillation,
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2016
[4] Dongyoon Han, Jiwhan Kim, Junmo Kim, Deep Pyramidal Residual Networks,2016
[5] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, Yoshua Bengio, FitNets: Hints for Thin Deep Nets 2014
[6] Sergey Zagoruyko, Nikos Komodakis, Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer, 2017
[7] Ying Zhang, Tao Xiang, Timothy M. Hospedales, Huchuan Lu, Deep Mutual Learning, 2017
[8] Nitish Shirish Keskar, Dheevatsa Mudigere, et al, On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, 2017
[9] Hossein Mobahi, Mehrdad Farajtabar, Peter L. Bartlett, Self-Distillation Amplifies Regularization in Hilbert Space, 2020
[10] https://twitter.com/thegradient/status/1228132843630387201?lang=en
[11] On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, Nitish Shirish Keskar et al, 2016.
Optional Questions