1 of 40

ME 4990: Intro to CS�Object-Oriented Programming & Machine Learning 101

�

Neural Network Training:�Backpropagation

2 of 40

Outline

Forward Propagation

Forward propagation
Loss Function

Training

Gradient Descent Introduction
Backward Propagation

3 of 40

Forward Propagation Demonstration

Fully Connected

ReLU

FC

ReLU

FC

input

Hidden 1

Hidden 1a

Hidden 2

Hidden 2a

output

4 of 40

Forward Propagation

What is the dimension of weight matrix and bias vector for each fully connected layer?

5 of 40

Forward Propagation

6 of 40

Forward Propagation

7 of 40

Forward propagation

After ReLU

Fully Connected

ReLU

FC

ReLU

FC

input

Hidden 1

Hidden 1a

Hidden 2

Hidden 2a

output

8 of 40

Forward Propagation

9 of 40

Forward Propagation

10 of 40

Forward propagation

After ReLU

Fully Connected

ReLU

FC

ReLU

FC

input

Hidden 1

Hidden 1a

Hidden 2

Hidden 2a

output

11 of 40

Forward Propagation

12 of 40

Forward Propagation

13 of 40

Forward propagation

Forward Propagation

Fully Connected

ReLU

FC

ReLU

FC

input

Hidden 1

Hidden 1a

Hidden 2

Hidden 2a

output

14 of 40

Mean squared error (MSE)

x		Label y
1	2	3
2	4	5
3	6	6

15 of 40

Mean squared error (MSE)

Fully Connected

ReLU

FC

ReLU

FC

input

Hidden 1

Hidden 1a

Hidden 2

Hidden 2a

output

16 of 40

Log loss

17 of 40

Outline

Forward Propagation

Forward propagation
Loss Function

Training

Gradient Descent Introduction
Backward Propagation

18 of 40

Training

19 of 40

Training

In Pytorch it is just easy

20 of 40

Training

21 of 40

Gradient Descent

22 of 40

Gradient

We have a topographic map

If we are at 489, which direction we shall go to climb that ascent fast
Go which direction will let us descent fastest?

23 of 40

Gradient

24 of 40

Use iterative way to find the minimum location

Iteratively, we find local minima

It is popular simply because it is simple and applicable to any functions (strictly, differentiable function)

25 of 40

Gradient Descent: learning rate

If learning rate is too small

If learning rate is too big: overshoot

26 of 40

Gradient Descent on 2D

Frankly gradient descent on 1D function doesn’t make sense, as there is no “direction”

But for 2D or above, to reach the minimum fastest using iterative approach:

Always descent on the gradient direction!

https://youtu.be/fXQXE96r4AY?si=dVd6T9kaxrU3avrd

27 of 40

Training

28 of 40

Training

29 of 40

Outline

Loss function
Backpropagation

30 of 40

Backpropagation

A cool visualization without true understanding

https://youtu.be/Ilg3gGewQ5U?si=p7j0ajHG2yH72syQ

31 of 40

Chain Rule

32 of 40

Chain Rule

33 of 40

Chain Rule

34 of 40

Let’s make it fancier

35 of 40

Let’s make it fancier

36 of 40

Backpropagation

37 of 40

Backpropagation

38 of 40

Backpropagation

39 of 40

Backpropagation

40 of 40

Training