1 of 27

Variational Dropout

Sparsifies

Deep Neural Networks

Dmitry Molchanov* Arsenii Ashukha* Dmitry Vetrov

August 30th 2017

2 of 27

Outline

Gaussian Dropout
Variational Dropout
Sparse Variational Dropout
Technical details
Experiments

2

3 of 27

Gaussian Dropout

3

input

weights

activations

nonlinearity

noise

It does not change the mean value

adds multiplicative noise

4 of 27

Variational Dropout

Posterior distribution

4

Diederik Kingma, Tim Salimans, and Max Welling, Variational dropout and the local reparameterization trick

Does not depend on

Prior distribution and the KL divergence term

Gaussian Dropout with noise

5 of 27

Variational Dropout with the LRT

5

Local Reparameterization:

Variational Dropout + Local Reparameterization

6 of 27

Variance Reduction

The variance of the gradients goes out of control when 𝛼 are large

Very noisy!

Solution: restrict 0 < 𝛼 < 1 (Kingma, et. al.)

6

It prohibits to use large alphas!

7 of 27

Why do we need large alphas?

7

Infinitely large noise that corrupts the data term

Equivalent binary dropout rate goes to 1

The KL term favors large dropout rates α

Large means:

8 of 27

Variance Reduction

Before:

After:

No noise!

Solution: restrict 0 < 𝛼 < 1 (Kingma, et. al.) or …

… use Additive Noise Parameterization!

Optimize the VLB w.r.t.

is a new independent variable

The variance of the gradients goes out of control when 𝛼 are large

Very noisy!

8

9 of 27

Sparse Variational Dropout: implementation

9

10 of 27

A problem: warm-up needed

10

Too many weights are pruned at the beginning
The KL term dominates the data-term early on
Use annealing of the KL coefficient!

Use 𝛕 = 0 for about 5 epochs
then linearly increase to 1 during ~10 epochs
and use 𝛕 = 1 after that

11 of 27

Other technical details

11

Use dropout rates for thresholding

We used log α = 3
No finetuning after thresholding is needed

One sigma per layer works just as well

And allows to speed-up computations

You can train simple models from scratch (with warm-up)
Pretrain large models with L2 and binary dropout

No warm-up needed in that case

12 of 27

… more technical details

12

Optimize w.r.t. log σ to avoid constrained optimization
Initialize σ with smaller values in convolutional layers

Convolutional layers are very sensitive to noise

Clip extreme values of α for numerical stability

We used -8 ≤ log α ≤ 8

Use Adam for optimization

To avoid manual tuning of separate learning rates for weights and variances

13 of 27

Visualization

13

LeNet-5: convolutional layer

LeNet-5: fully-connected layer

(100 x 100 patch)

14 of 27

Lenet-5-Caffe and Lenet-300-100 on MNIST

Fully Connected network: LeNet-300-100 Convolutional network: Lenet-5-Caffe

14

15 of 27

VGG-like on CIFAR-10

13 Convolutional layers and 2 Fully-Connected layers
Pre-Activation Batch Norm and Binary Dropout after each layer

15

Official Torch Blog: Zagoruyko, http://torch.ch/blog/2015/07/30/cifar.html

16 of 27

VGG-like on CIFAR-10

16

Model width scale factor k

Compression rate

Error %

Number of filters / neurons is linearly scaled by k (the width of the network)

17 of 27

Random Labeling

17

Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization."

No dependency between data and labels ⇒ Sparse VD yields an empty model where conventional models easily overfit.

18 of 27

Future Extensions for Compression

18

Easy to incorporate into the Deep Compression framework: just replace pruning!

Sparse VD!

In theory, it can also be combined with soft weight sharing that provides better quantization

Soft Weight Sharing

19 of 27

Structured Bayesian Pruning via LogN Multiplicative Noise

19

20 of 27

Bayesian Sparsification of Recurrent Neural Networks

20

Log-uniform prior
The same sample W for all timestamps
Plain RT for W

h

21 of 27

Bayesian Sparsification of RNNs: experiments

21

22 of 27

Bayesian Sparsification of RNNs: experiments

22

23 of 27

Sparse Variational Dropout

Additive Noise Parametrization reduces the variance of SG
Variational Dropout Sparsifies Deep Neural Networks
No hyperparameters, just the parameters of optimization
Follow-up papers:

Compression of RNNs: https://arxiv.org/abs/1708.00077
Group-wise sparsity: https://arxiv.org/abs/1705.07283

23

GitHub: https://goo.gl/2D4tFW

24 of 27

Group sparsity

Sparse VD provides general sparsity with no structure
Sparse VD can’t be efficiently generalized for group sparsity
Structured sparsity is the key to DNN acceleration
Structured sparsity allows more efficient compression

24

25 of 27

Noise sparsity

25

input

weights

activations

nonlinearity

noise

input

weights

activations

nonlinearity

noise

26 of 27

Distributions

26

Variational Dropout:

Log-Normal Dropout:

27 of 27

Results on MNIST

27