2 of 97

Loss function

Training dataset of I pairs of input/output examples:

Loss function or cost function measures how bad model is:

or for short:

Returns a scalar that is smaller when model maps inputs to outputs better

4 of 97

Problem 1: Computing gradients

Loss: sum of individual terms:

SGD Algorithm:

Parameters:

Need to compute gradients

5 of 97

Why is this such a big deal?

A neural network is just an equation:

But it’s a huge equation, and we need to compute derivative

for every parameter
for every point in the batch
for every iteration of SGD

6 of 97

Problem 2: initialization

Where should we start the parameters before we commence SGD?

7 of 97

Gradients

Backpropagation intuition
Toy model
Background mathematics
Backpropagation forward pass
Backpropagation backward pass
Algorithmic differentiation

8 of 97

Problem 1: Computing gradients

Loss: sum of individual terms:

SGD Algorithm:

Parameters:

Need to compute gradients

9 of 97

Algorithm to compute gradient efficiently

“Backpropagation algorithm”
Rumelhart, Hinton, and Williams (1986)

10 of 97

BackProp intuition #1: the forward pass

Orange weight multiplies activation (ReLU output) in previous layer
We want to know how change in orange weight affects loss
If we double activation in previous layer, weight will have twice the effect
Conclusion: we need to know the activations at each layer.

11 of 97

BackProp intuition #2: the backward pass

To calculate how a small change in a weight or bias feeding into hidden layer h₃ modifies the loss, we need to know:�

how a change in layer h₃ changes the model output f
how a change in model output changes the loss l

12 of 97

BackProp intuition #2: the backward pass

To calculate how a small change in a weight or bias feeding into hidden layer h₂ modifies the loss, we need to know:

how a change in layer h₂ affects h₃
how h₃ changes the model output
how this output changes the loss

13 of 97

BackProp intuition #2: the backward pass

To calculate how a small change in a weight or bias feeding into hidden layer h₁ modifies the loss, we need to know:

how a change in layer h₁ affects layer h₂
how a change in layer h₂ affects layer h₃
how layer h₃ changes the model output
how the model output changes the loss

14 of 97

Gradients

Backpropagation intuition
Toy model
Background mathematics
Backpropagation forward pass
Backpropagation backward pass
Algorithmic differentiation

15 of 97

Toy function

Consists of a series of functions that are composed with each other.
Unlike in neural networks just uses scalars (not vectors)
“Activation functions” sin, exp, cos

16 of 97

Toy function

Derivatives

17 of 97

Gradients of toy function

We want to calculate:

18 of 97

Gradients of composed functions

Calculating expressions by hand:

some expressions very complicated.
obvious redundancy (look at sin terms in bottom equation)

19 of 97

Forward pass

1. Write this as a series of

intermediate calculations

2. Compute these intermediate quantities

20 of 97

Forward pass

1. Write this as a series of

intermediate calculations

2. Compute these intermediate quantities

21 of 97

Forward pass

1. Write this as a series of

intermediate calculations

2. Compute these intermediate quantities

22 of 97

Backward pass