1 of 27

Variational Dropout

Sparsifies

Deep Neural Networks

Dmitry Molchanov* Arsenii Ashukha* Dmitry Vetrov

August 30th 2017

2 of 27

Outline

  • Gaussian Dropout
  • Variational Dropout
  • Sparse Variational Dropout
  • Technical details
  • Experiments

2

3 of 27

Gaussian Dropout

3

input

weights

activations

nonlinearity

noise

It does not change the mean value

adds multiplicative noise

4 of 27

Variational Dropout

  • Posterior distribution

4

Diederik Kingma, Tim Salimans, and Max Welling, Variational dropout and the local reparameterization trick

Does not depend on

Prior distribution and the KL divergence term

Gaussian Dropout with noise

5 of 27

Variational Dropout with the LRT

5

Local Reparameterization:

Variational Dropout + Local Reparameterization

6 of 27

Variance Reduction

The variance of the gradients goes out of control when 𝛼 are large

Very noisy!

Solution: restrict 0 < 𝛼 < 1 (Kingma, et. al.)

6

It prohibits to use large alphas!

7 of 27

Why do we need large alphas?

7

  • Infinitely large noise that corrupts the data term
  • Equivalent binary dropout rate goes to 1

The KL term favors large dropout rates Ξ±

Large means:

8 of 27

Variance Reduction

Before:

After:

No noise!

Solution: restrict 0 < 𝛼 < 1 (Kingma, et. al.) or …

… use Additive Noise Parameterization!

Optimize the VLB w.r.t.

is a new independent variable

The variance of the gradients goes out of control when 𝛼 are large

Very noisy!

8

9 of 27

Sparse Variational Dropout: implementation

9

10 of 27

A problem: warm-up needed

10

  • Too many weights are pruned at the beginning
  • The KL term dominates the data-term early on
  • Use annealing of the KL coefficient!
  • Use 𝛕 = 0 for about 5 epochs
  • then linearly increase to 1 during ~10 epochs
  • and use 𝛕 = 1 after that

11 of 27

Other technical details

11

  • Use dropout rates for thresholding
    • We used log Ξ± = 3
    • No finetuning after thresholding is needed
  • One sigma per layer works just as well
    • And allows to speed-up computations
  • You can train simple models from scratch (with warm-up)
  • Pretrain large models with L2 and binary dropout
    • No warm-up needed in that case

12 of 27

… more technical details

12

  • Optimize w.r.t. log Οƒ to avoid constrained optimization
  • Initialize Οƒ with smaller values in convolutional layers
    • Convolutional layers are very sensitive to noise
  • Clip extreme values of Ξ± for numerical stability
    • We used -8 ≀ log Ξ± ≀ 8
  • Use Adam for optimization
    • To avoid manual tuning of separate learning rates for weights and variances

13 of 27

Visualization

13

LeNet-5: convolutional layer

LeNet-5: fully-connected layer

(100 x 100 patch)

14 of 27

Lenet-5-Caffe and Lenet-300-100 on MNIST

Fully Connected network: LeNet-300-100 Convolutional network: Lenet-5-Caffe

14

15 of 27

VGG-like on CIFAR-10

  • 13 Convolutional layers and 2 Fully-Connected layers
  • Pre-Activation Batch Norm and Binary Dropout after each layer

15

Official Torch Blog: Zagoruyko, http://torch.ch/blog/2015/07/30/cifar.html

16 of 27

VGG-like on CIFAR-10

16

Model width scale factor k

Compression rate

Error %

Number of filters / neurons is linearly scaled by k (the width of the network)

17 of 27

Random Labeling

17

Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization."

No dependency between data and labels β‡’ Sparse VD yields an empty model where conventional models easily overfit.

18 of 27

Future Extensions for Compression

18

Easy to incorporate into the Deep Compression framework: just replace pruning!

Sparse VD!

In theory, it can also be combined with soft weight sharing that provides better quantization

Soft Weight Sharing

19 of 27

Structured Bayesian Pruning via LogN Multiplicative Noise

19

20 of 27

Bayesian Sparsification of Recurrent Neural Networks

20

  • Log-uniform prior
  • The same sample W for all timestamps
  • Plain RT for W

h

21 of 27

Bayesian Sparsification of RNNs: experiments

21

22 of 27

Bayesian Sparsification of RNNs: experiments

22

23 of 27

Sparse Variational Dropout

  • Additive Noise Parametrization reduces the variance of SG
  • Variational Dropout Sparsifies Deep Neural Networks
  • No hyperparameters, just the parameters of optimization
  • Follow-up papers:

23

24 of 27

Group sparsity

  • Sparse VD provides general sparsity with no structure
  • Sparse VD can’t be efficiently generalized for group sparsity
  • Structured sparsity is the key to DNN acceleration
  • Structured sparsity allows more efficient compression

24

25 of 27

Noise sparsity

25

input

weights

activations

nonlinearity

noise

input

weights

activations

nonlinearity

noise

26 of 27

Distributions

26

Variational Dropout:

Log-Normal Dropout:

27 of 27

Results on MNIST

27