1 of 26

Deep Learning (DEEP-0001)�

8 – Initialization

2 of 26

Initialization

  • Need for initialization
  • He initialization
  • Interlude: Expectations
  • Show that
  • Write variance of pre-activations f’ in terms of activations h in previous layer

  • Write variance of pre-activations f’ in terms of pre-activations f in previous layer

3 of 26

Initialization

  • Consider standard building block of NN in terms of preactivations:

  • How do we initialize the biases and weights?

4 of 26

Initialization

  •  

5 of 26

Initialization

  • Need for initialization
  • He initialization
  • Interlude: Expectations
  • Show that
  • Write ariance of pre-activations f’ in terms of activations h in previous layer

  • Write variance of pre-activations f’ in terms of pre-activations f in previous layer

6 of 26

Exploding gradients

Vanishing gradients

7 of 26

Exploding gradients

Vanishing gradients

8 of 26

Initialization

  • Need for initialization
  • He initialization
  • Interlude: Expectations
  • Show that
  • Write variance of pre-activations f’ in terms of activations h in previous layer

  • Write variance of pre-activations f’ in terms of pre-activations f in previous layer

9 of 26

Expectations

Interpretation: what is the average value of g[x] when taking into account the probability of x?

10 of 26

Expectations

11 of 26

Rules for manipulating expectation

12 of 26

Initialization

  • Need for initialization
  • He initialization
  • Interlude: Expectations
  • Show that
  • Write variance of pre-activations f’ in terms of activations h in previous layer

  • Write variance of pre-activations f’ in terms of pre-activations f in previous layer

13 of 26

Aim: keep variance same between two layers (Initialization for forward pass)

Consider the mean of the pre-activations:

14 of 26

Rule 1:

Rule 2:

Rule 3:

Rule 4:

15 of 26

Rule 1:

Rule 2:

Rule 3:

Rule 4:

16 of 26

Rule 1:

Rule 2:

Rule 3:

Rule 4:

 

17 of 26

Initialization

  • Need for initialization
  • He initialization
  • Interlude: Expectations
  • Show that
  • Write variance of pre-activations f’ in terms of activations h in previous layer

  • Write variance of pre-activations f’ in terms of pre-activations f in previous layer

18 of 26

Aim: keep variance same between two layers

19 of 26

Rule 1:

Rule 2:

Rule 3:

Rule 4:

 

20 of 26

Rule 1:

Rule 2:

Rule 3:

Rule 4:

 

21 of 26

Rule 1:

Rule 2:

Rule 3:

Rule 4:

 

22 of 26

Rule 1:

Rule 2:

Rule 3:

Rule 4:

 

23 of 26

Initialization

  • Need for initialization
  • He initialization
  • Interlude: Expectations
  • Show that
  • Write variance of pre-activations f’ in terms of activations h in previous layer

  • Write variance of pre-activations f’ in terms of pre-activations f in previous layer

24 of 26

25 of 26

Aim: keep variance same between two layers

Should choose:

This is called He initialization.

26 of 26

PyTorch code

  • Define a neural network
  • Initialize params with He initialization
  • Define loss function
  • Choose optimization algorithm
  • Choose initial learning rate
  • Choose learning rates schedule
  • Make some random data
  • Train for 100 batches