1 of 17

Activation function + vanishing gradient problem

Andreas Baum

2 of 17

Sigmoid

Problems:

Vanishing gradient problem
Output isn’t zero centered, which makes the gradient updates go too far in different directions and optimization harder.
Slow convergence

https://towardsdatascience.com/activation-functions-and-its-types-which-is-better-a9a5310cc8f

3 of 17

Tanh

f(x) = 1 — exp(-2x) / 1 + exp(-2x).

Problems:

Vanishing gradient problem

https://towardsdatascience.com/activation-functions-and-its-types-which-is-better-a9a5310cc8f

4 of 17

Vanishing gradient problem

5 of 17

Vanishing gradient problem

6 of 17

Vanishing gradient problem

7 of 17

Vanishing gradient problem

https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484

8 of 17

Vanishing gradient problem

Values range from 0 to 0.25

[0, 0.25] [0, 0.25]

9 of 17

Vanishing gradient problem

= [0, 0.25] * [0, 0.25] * [0, 0.25] * [0, 0.25] * [0, 0.25]

10 of 17

Vanishing gradient problem

= [0, 0.25] * [0, 0.25] * [0, 0.25] * [0, 0.25] * [0, 0.25]

0.00002754 = 0.12* 0.25 * 0.06 * 0.17 *0.09

11 of 17

Vanishing gradient problem

= [0, 0.25] * [0, 0.25] * [0, 0.25] * [0, 0.25] * [0, 0.25]

0.00002754 = 0.12* 0.25 * 0.06 * 0.17 *0.09

0.4 - 0.1* 0.00002754 = 3.999997246

12 of 17

Vanishing gradient solutions - Batch normalization

https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484

13 of 17

ReLU

Problems:

could result in Dead Neurons

https://towardsdatascience.com/activation-functions-and-its-types-which-is-better-a9a5310cc8f, https://medium.com/@kanchansarkar/relu-not-a-differentiable-function-why-used-in-gradient-based-optimization-7fef3a4cecec

14 of 17

ReLU

Problems:

could result in Dead Neurons

https://towardsdatascience.com/activation-functions-and-its-types-which-is-better-a9a5310cc8f, https://medium.com/@kanchansarkar/relu-not-a-differentiable-function-why-used-in-gradient-based-optimization-7fef3a4cecec

15 of 17

LeakyReLU and PReLU

https://towardsdatascience.com/activation-functions-and-its-types-which-is-better-a9a5310cc8f, https://medium.com/@kanchansarkar/relu-not-a-differentiable-function-why-used-in-gradient-based-optimization-7fef3a4cecec

16 of 17

LeakyReLU and PReLU

https://towardsdatascience.com/activation-functions-and-its-types-which-is-better-a9a5310cc8f, https://medium.com/@kanchansarkar/relu-not-a-differentiable-function-why-used-in-gradient-based-optimization-7fef3a4cecec

17 of 17

Softmax

Classification - computes the probabilites