Variational Dropout
Sparsifies
Deep Neural Networks
Dmitry Molchanov* Arsenii Ashukha* Dmitry Vetrov
August 30th 2017
Outline
2
Gaussian Dropout
3
input
weights
activations
nonlinearity
noise
It does not change the mean value
adds multiplicative noise
Variational Dropout
4
Diederik Kingma, Tim Salimans, and Max Welling, Variational dropout and the local reparameterization trick
Does not depend on
Prior distribution and the KL divergence term
Gaussian Dropout with noise
Variational Dropout with the LRT
5
Local Reparameterization:
Variational Dropout + Local Reparameterization
Variance Reduction
The variance of the gradients goes out of control when πΌ are large
Very noisy!
Solution: restrict 0 < πΌ < 1 (Kingma, et. al.)
6
It prohibits to use large alphas!
Why do we need large alphas?
7
The KL term favors large dropout rates Ξ±
Large means:
Variance Reduction
Before:
After:
No noise!
Solution: restrict 0 < πΌ < 1 (Kingma, et. al.) or β¦
β¦ use Additive Noise Parameterization!
Optimize the VLB w.r.t.
is a new independent variable
The variance of the gradients goes out of control when πΌ are large
Very noisy!
8
Sparse Variational Dropout: implementation
9
A problem: warm-up needed
10
Other technical details
11
β¦ more technical details
12
Visualization
13
LeNet-5: convolutional layer
LeNet-5: fully-connected layer
(100 x 100 patch)
Lenet-5-Caffe and Lenet-300-100 on MNIST
Fully Connected network: LeNet-300-100 Convolutional network: Lenet-5-Caffe
14
VGG-like on CIFAR-10
15
Official Torch Blog: Zagoruyko, http://torch.ch/blog/2015/07/30/cifar.html
VGG-like on CIFAR-10
16
Model width scale factor k
Compression rate
Error %
Number of filters / neurons is linearly scaled by k (the width of the network)
Random Labeling
17
Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization."
No dependency between data and labels β Sparse VD yields an empty model where conventional models easily overfit.
Future Extensions for Compression
18
Easy to incorporate into the Deep Compression framework: just replace pruning!
Sparse VD!
In theory, it can also be combined with soft weight sharing that provides better quantization
Soft Weight Sharing
Structured Bayesian Pruning via LogN Multiplicative Noise
19
Bayesian Sparsification of Recurrent Neural Networks
20
h
Bayesian Sparsification of RNNs: experiments
21
Bayesian Sparsification of RNNs: experiments
22
Sparse Variational Dropout
23
GitHub: https://goo.gl/2D4tFW
Group sparsity
24
Noise sparsity
25
input
weights
activations
nonlinearity
noise
input
weights
activations
nonlinearity
noise
Distributions
26
Variational Dropout:
Log-Normal Dropout:
Results on MNIST
27