1 of 97

Deep Learning (DEEP-0001)�

7 – Gradients

2 of 97

Loss function

  • Training dataset of I pairs of input/output examples:

  • Loss function or cost function measures how bad model is:

or for short:

Returns a scalar that is smaller when model maps inputs to outputs better

3 of 97

Example

4 of 97

Problem 1: Computing gradients

Loss: sum of individual terms:

SGD Algorithm:

Parameters:

Need to compute gradients

5 of 97

Why is this such a big deal?

  • A neural network is just an equation:

  • But it’s a huge equation, and we need to compute derivative
    • for every parameter
    • for every point in the batch
    • for every iteration of SGD

6 of 97

Problem 2: initialization

Where should we start the parameters before we commence SGD?

7 of 97

Gradients

  • Backpropagation intuition
  • Toy model
  • Background mathematics
  • Backpropagation forward pass
  • Backpropagation backward pass
  • Algorithmic differentiation

8 of 97

Problem 1: Computing gradients

Loss: sum of individual terms:

SGD Algorithm:

Parameters:

Need to compute gradients

9 of 97

Algorithm to compute gradient efficiently

  • Backpropagation algorithm
  • Rumelhart, Hinton, and Williams (1986)

10 of 97

BackProp intuition #1: the forward pass

  • Orange weight multiplies activation (ReLU output) in previous layer
  • We want to know how change in orange weight affects loss
  • If we double activation in previous layer, weight will have twice the effect
  • Conclusion: we need to know the activations at each layer.

11 of 97

BackProp intuition #2: the backward pass

To calculate how a small change in a weight or bias feeding into hidden layer h3 modifies the loss, we need to know:�

    • how a change in layer h3 changes the model output f
    • how a change in model output changes the loss l

12 of 97

BackProp intuition #2: the backward pass

To calculate how a small change in a weight or bias feeding into hidden layer h2 modifies the loss, we need to know:

    • how a change in layer h2 affects h3
    • how h3 changes the model output
    • how this output changes the loss

13 of 97

BackProp intuition #2: the backward pass

To calculate how a small change in a weight or bias feeding into hidden layer h1 modifies the loss, we need to know:

    • how a change in layer h1 affects layer h2
    • how a change in layer h2 affects layer h3
    • how layer h3 changes the model output
    • how the model output changes the loss

14 of 97

Gradients

  • Backpropagation intuition
  • Toy model
  • Background mathematics
  • Backpropagation forward pass
  • Backpropagation backward pass
  • Algorithmic differentiation

15 of 97

Toy function

  • Consists of a series of functions that are composed with each other.
  • Unlike in neural networks just uses scalars (not vectors)
  • “Activation functions” sin, exp, cos

16 of 97

Toy function

Derivatives

17 of 97

Gradients of toy function

We want to calculate:

 

18 of 97

Gradients of composed functions

Calculating expressions by hand:

    • some expressions very complicated.
    • obvious redundancy (look at sin terms in bottom equation)

19 of 97

Forward pass

1. Write this as a series of

intermediate calculations

2. Compute these intermediate quantities

20 of 97

Forward pass

1. Write this as a series of

intermediate calculations

2. Compute these intermediate quantities

21 of 97

Forward pass

1. Write this as a series of

intermediate calculations

2. Compute these intermediate quantities

22 of 97

Backward pass

1. Compute the derivatives of the loss with respect to these intermediate quantities, but in reverse order.

23 of 97

Backward pass

1. Compute the derivatives of the loss with respect to these intermediate quantities, but in reverse order.

24 of 97

Backward pass

1. Compute the derivatives of the loss with respect to these intermediate quantities, but in reverse order.

  • The first of these derivatives is trivial

25 of 97

Backward pass

1. Compute the derivatives of the loss with respect to these intermediate quantities, but in reverse order.

  • The second of these derivatives is computed via the chain rule

 

26 of 97

Backward pass

1. Compute the derivatives of the loss with respect to these intermediate quantities, but in reverse order.

  • The second derivative is computed via the chain rule

 

 

 

27 of 97

Backward pass

1. Compute the derivatives of the loss with respect to these intermediate quantities, but in reverse order.

  • The second of these derivatives is computed via the chain rule

Already computed!

ω3

 

28 of 97

Backward pass

1. Compute the derivatives of the loss with respect to these intermediate quantities, but in reverse order.

  • The remaining derivatives also calculated by further use of chain rule

29 of 97

Backward pass

1. Compute the derivatives of the loss with respect to these intermediate quantities, but in reverse order.

  • The remaining derivatives also calculated by further use of chain rule

Already computed!

-sin[f2]

30 of 97

Backward pass

1. Compute the derivatives of the loss with respect to these intermediate quantities, but in reverse order.

  • The remaining derivatives also calculated by further use of chain rule

31 of 97

Backward pass

1. Compute the derivatives of the loss with respect to these intermediate quantities, but in reverse order.

  • The remaining derivatives also calculated by further use of chain rule

32 of 97

Backward pass

1. Compute the derivatives of the loss with respect to these intermediate quantities, but in reverse order.

  • The remaining derivatives also calculated by further use of chain rule

33 of 97

Backward pass

2. Find how the loss changes as a function of the parameters β and ω.

  • Another application of the chain rule

 

 

 

34 of 97

Backward pass

2. Find how the loss changes as a function of the parameters β and ω.

  • Another application of the chain rule

Already calculated in part 1.

hk

 

35 of 97

Backward pass

2. Find how the loss changes as a function of the parameters β and ω.

  • Another application of the chain rule
  • Similarly for β parameters

36 of 97

Backward pass

2. Find how the loss changes as a function of the parameters β and ω.

37 of 97

Examples:

38 of 97

Backpropagation

39 of 97

Backpropagation

40 of 97

Backpropagation

41 of 97

Backpropagation

42 of 97

Backpropagation

43 of 97

Backpropagation

44 of 97

Backpropagation

45 of 97

Backpropagation

46 of 97

Backpropagation

47 of 97

Backpropagation

48 of 97

Backpropagation

49 of 97

Backpropagation

50 of 97

Backpropagation

51 of 97

Backpropagation

52 of 97

Backpropagation

53 of 97

Backpropagation

54 of 97

Backpropagation

55 of 97

Backpropagation

56 of 97

Backpropagation

57 of 97

Backpropagation

58 of 97

Backpropagation

59 of 97

Backpropagation

60 of 97

Backpropagation

61 of 97

Backpropagation

62 of 97

Backpropagation

63 of 97

Backpropagation

64 of 97

Backpropagation

65 of 97

Backpropagation

66 of 97

Gradients

  • Backpropagation intuition
  • Toy model
  • Background mathematics
  • Backpropagation forward pass
  • Backpropagation backward pass
  • Algorithmic differentiation

67 of 97

Matrix calculus

Scalar function f[] of a vector a

68 of 97

Matrix calculus

Scalar function f[] of a matrix A

69 of 97

Matrix calculus

Vector function f[] of vector a

70 of 97

Comparing vector and matrix

Scalar derivatives:

71 of 97

Comparing vector and matrix

Scalar derivatives:

Matrix derivatives:

72 of 97

Comparing vector and matrix

Scalar derivatives:

Matrix derivatives:

73 of 97

Gradients

  • Backpropagation intuition
  • Toy model
  • Background mathematics
  • Backpropagation forward pass
  • Backpropagation backward pass
  • Algorithmic differentiation

74 of 97

The forward pass

1. Write this as a series of

intermediate calculations

75 of 97

The forward pass

1. Write this as a series of

intermediate calculations

2. Compute these intermediate quantities

76 of 97

Gradients

  • Backpropagation intuition
  • Toy model
  • Background mathematics
  • Backpropagation forward pass
  • Backpropagation backward pass
  • Algorithmic differentiation

77 of 97

The backward pass

1. Write this as a series of

intermediate calculations

2. Compute these intermediate quantities

3. Take derivatives of output with respect to intermediate quantities

78 of 97

The backward pass

1. Write this as a series of

intermediate calculations

2. Compute these intermediate quantities

3. Take derivatives of output with respect to intermediate quantities

79 of 97

Yikes!

  • But:

  • Quite similar to:

80 of 97

The backward pass

1. Write this as a series of

intermediate calculations

2. Compute these intermediate quantities

3. Take derivatives of output with respect to intermediate quantities

81 of 97

The backward pass

1. Write this as a series of

intermediate calculations

2. Compute these intermediate quantities

3. Take derivatives of output with respect to intermediate quantities

82 of 97

Derivative of ReLU

83 of 97

Derivative of ReLU

“Indicator function”

84 of 97

Derivative of RELU

1. Consider:

where:

2. We could equivalently write:

3. Taking the derivative

4. We can equivalently pointwise multiply by diagonal

85 of 97

The backward pass

1. Write this as a series of

intermediate calculations

2. Compute these intermediate quantities

3. Take derivatives of output with respect to intermediate quantities

86 of 97

The backward pass

1. Write this as a series of

intermediate calculations

2. Compute these intermediate quantities

3. Take derivatives of output with respect to intermediate quantities

4. Take derivatives w.r.t.

parameters

87 of 97

The backward pass

1. Write this as a series of

intermediate calculations

2. Compute these intermediate quantities

3. Take derivatives of output with respect to intermediate quantities

4. Take derivatives w.r.t.

parameters

88 of 97

Backprop summary

89 of 97

Backprop summary

90 of 97

Backprop summary

91 of 97

Backprop summary

92 of 97

Backprop summary

93 of 97

Backprop summary

94 of 97

Backprop summary

95 of 97

Pros and cons

  • Extremely efficient
    • Only need matrix multiplication and thresholding for ReLU functions
  • Memory hungry – must store all the intermediate quantities
  • Sequential
    • can process multiple batches in parallel
    • but things get harder if the whole model doesn’t fit on one machine.

96 of 97

Gradients

  • Backpropagation intuition
  • Toy model
  • Background mathematics
  • Backpropagation forward pass
  • Backpropagation backward pass
  • Algorithmic differentiation

97 of 97

Algorithmic differentiation

  • Modern deep learning frameworks compute derivatives automatically
  • You just have to specify the model and the loss
  • How? Algorithmic differentiation
    • Each component knows how to compute its own derivative
      • ReLU knows how to compute deriv of output w.r.t. input
      • Linear function knows how to compute deriv of output w.r.t. input
      • Linear function knows how to compute deriv of output w.r.t. parameter
    • You specify how the order of the components
    • It can compute the chain of derivatives
  • Works with branches as long as it’s still an acyclic graph