1 of 23

Dropout

Deep Learning Seminar

School of Electrical Engineering, Tel Aviv University

2 of 23

Outline

Motivation
Inspiration
Model description
Results
Relation to former methods (weight decay)
DropConnect generalization
Summary

Deep Learning Seminar - Dropout

3 of 23

Motivation

Neural networks are prone to overfitting.
The optimal solution is to use a Bayesian framework.
This solution is infeasible when networks are big.
We need something faster.

Deep Learning Seminar - Dropout

4 of 23

Dropout

Deep Learning Seminar - Dropout

5 of 23

Dropout - an equivalent method

Deep Learning Seminar - Dropout

6 of 23

Why apply dropout on units, with all their ingoing and outgoing arcs, and not just on the arcs themselves?��- We will get there later…

Deep Learning Seminar - Dropout

7 of 23

Dropout – model description

Deep Learning Seminar - Dropout

8 of 23

Dropout – model description

Deep Learning Seminar - Dropout

9 of 23

Dropout – model description

Deep Learning Seminar - Dropout

Can be a very bad approximation, particularly for the ReLU activation.

Masking matrix. Has columns of 0's.

10 of 23

Experimental Results

SVHN – Street View House Numbers

Dropout is applied also to convolutional layers.
All hidden units are ReLUs.

Deep Learning Seminar - Dropout

11 of 23

Experimental Results

CIFAR-10 and CIFAR-100:

Images drawn from 10 and 100 categories respectively.

Deep Learning Seminar - Dropout

12 of 23

Experimental Results

The effect of dropout on learned features:

Without dropout, units tend to compensate for mistakes of other units.
This leads to overfitting, since these co-adaptations do not generalize to unseen data.
Dropout prevents co-adaptations by making the presence of other hidden units unreliable.

Deep Learning Seminar - Dropout

MNIST, one hidden layer, 256 ReLUs

No dropout

Units have co-adapted. Each unit does not detect a meaningful feature.

13 of 23

Experimental Results

Deep Learning Seminar - Dropout

14 of 23

Experimental Results

The effect of data set size:

An architecture of 784-1024-1024-2048-10 is used on the MNIST dataset.

Deep Learning Seminar - Dropout

Huge data set

Dropout barely improves the error rate. The data set is big enough, so that overfitting is not an issue.

Average to large data set

Dropout improves error rate.

Extremely small data set

Dropout does not improve error rate, and even makes it worse.

15 of 23

Experimental Results

Deep Learning Seminar - Dropout

How good is the approximated averaging technique?

16 of 23

Weight Decay

Deep Learning Seminar - Dropout

17 of 23

Weight Decay

Deep Learning Seminar - Dropout

18 of 23

DropConnect

Deep Learning Seminar - Dropout

Dropout

DropConnect

A private case

19 of 23

DropConnect

Input to the activation function

A weighted sum of Bernoulli variables. Can be approximated by a Gaussian

Statistics of the Gaussian

20 of 23

DropConnect

No-drop, Dropout and DropConnect comparison:

MNIST: 2 layers (800 neurons each).
CIFAR-10: 4 layers. Dropout/DropConnect is applied only on the final layer.
SVHN: Generally same architecture as in CIFAR-10. Due to the large training set size, all methods achieve nearly the same performance.

Deep Learning Seminar - Dropout

MNIST

CIFAR-10

SVHN

21 of 23

DropConnect

In DropConnect’s paper, they achieve the lowest error rate recorded so far in MNIST!

Deep Learning Seminar - Dropout

These results are equivalent for the use of DropConnect, and the use of no drop at all…

22 of 23

DropConnect

DropConnect’s drawbacks:

Training with DropConnect is slower.�
DropConnect’s implementation is more complicated than Dropout’s implementation.�
In their paper, DropConnect has been proven to work mostly when using more than one network.

Deep Learning Seminar - Dropout

23 of 23

Summary

Dropout is a very good and fast regularization method.�
Dropout is a bit slow to train (2-3 times slower than without dropout).

If the amount of data is average-large – dropout excels. When data is big enough, dropout does not help much.

Dropout achieves better results than former used regularization methods (Weight Decay).

DropConnect is a generalization of dropout. Its superiority over dropout is unclear. It involves complications in the implementation, and is also slower to train than dropout.

Deep Learning Seminar - Dropout