1 of 27

Neural Networks II: Backpropagation

Vatsal Sivaratri

Modified from last year’s presentation

TJ Machine Learning Club

Slide 1

2 of 27

Review

TJ Machine Learning Club

Slide 2

3 of 27

The Perceptron

  • Linear function followed by a nonlinear activation function

Weights

Bias

TJ Machine Learning Club

Slide 3

4 of 27

Today’s Goal

  • Understand how we find the best values for the weights and biases through gradient descent

Weights

Bias

TJ Machine Learning Club

Slide 4

5 of 27

But first, Calculus! (At least the differential part)

Algebraic Approach:

Calculus Approach:

dy

dx

TJ Machine Learning Club

Slide 5

6 of 27

IMPORTANT DISTINCTION!!!

d

dx

dy

dx

This is a command!

= “Take the derivative with respect to x”

This is a value!

= “The derivative of y with respect to x”

TJ Machine Learning Club

Slide 6

7 of 27

Rest of the Basic Calculus Rules

TJ Machine Learning Club

Slide 7

8 of 27

The Most Important - Chain Rule!

dy

dx

dy

du

du

dx

=

TJ Machine Learning Club

Slide 8

9 of 27

The Only Multi- you’ll need!

Partial Derivatives:

Move from dy/dx to ∂, measuring change in one variable while others stay constant.

Gradients:

"Extend derivatives to vectors; gradients show the steepest ascent in multiple dimensions."

TJ Machine Learning Club

Slide 9

10 of 27

Minimizing Error: Gradient Descent

TJ Machine Learning Club

Slide 10

11 of 27

The Intuitive Explanation

  • The ball will roll down a hill

Direction we push the weight

TJ Machine Learning Club

Slide 11

12 of 27

The Intuitive Explanation

  • In a very high dimension space → each weight/bias is a different axis
  • Why not just find the global minimum? → Hard to do in high dimension spaces, too many possibilities

TJ Machine Learning Club

Slide 12

13 of 27

The Gradient

  • The gradient is the opposite of the direction that we push in, the direction of steepest ascent of loss

Gradients

(also called a derivative)

Subtraction gives us descent

TJ Machine Learning Club

Slide 13

14 of 27

The Gradient as Slope

TJ Machine Learning Club

Slide 14

15 of 27

The Learning Rate

  • 𝛼 is the learning rate, and is generally a small positive value
  • It scales how big a step we make
    • Large Alpha = Big Step
    • Small Alpha = Small Step

TJ Machine Learning Club

Slide 15

16 of 27

Optimizing the Learning Rate

  • Getting the learning rate right is one of the most important parts of Neural Network training!

TJ Machine Learning Club

Slide 16

17 of 27

TJ Machine Learning Club

Slide 17

18 of 27

Calculating One Neural Network Iteration

3

-2

-3

8

13

5

W13 = 1

W36 = -1

W23 = 3

W14 = 4

W15 = 3

W24 = 2

W25 = -2

W56 = 2

W46 = -3

Linear Activation Function (y = x) and no biases, n1 = 3, n2 = -2, y = 9, α = 0.1

Goal: Update W36

W36 = W36 - α

Loss (Doesn’t need to be this way!)

TJ Machine Learning Club

Slide 18

19 of 27

Calculating One Neural Network Iteration

3

-2

-3

8

13

5

W13 = 1

W36 = -1

W23 = 3

W14 = 4

W15 = 3

W24 = 2

W25 = -2

W56 = 2

W46 = -3

Linear Activation Function (y = x) and no biases, n1 = 3, n2 = -2, y = 9, α = 0.1

E = ½(n6 - y)2

n6 = W36n3 + W46n4 + W56n5

=

MOST IMPORTANT THING

TJ Machine Learning Club

Slide 19

20 of 27

Calculating One Neural Network Iteration

3

-2

-3

8

13

5

W13 = 1

W36 = -1

W23 = 3

W14 = 4

W15 = 3

W24 = 2

W25 = -2

W56 = 2

W46 = -3

Linear Activation Function (y = x) and no biases, n1 = 3, n2 = -2, y = 9, α = 0.1

E = ½(n6 - y)2

n6 = W36n3 + W46n4 + W56n5

W36 = W36 - α

W36 = -1 - 0.1(12) = -1 - 1.2 = -2.2

= (n6 - y) * n3 = -4 * -3 = 12

=

TJ Machine Learning Club

Slide 20

21 of 27

Calculating One Neural Network Iteration

3

-2

-3

8

13

5

W13 = 1

W36 = -1

W23 = 3

W14 = 4

W15 = 3

W24 = 2

W25 = -2

W56 = 2

W46 = -3

Linear Activation Function (y = x) and no biases, n1 = 3, n2 = -2, y = 9, α = 0.1

Goal: Update W13

W13 = W13 - α

Loss (Doesn’t need to be this way!)

TJ Machine Learning Club

Slide 21

22 of 27

Calculating One Neural Network Iteration

3

-2

-3

8

13

5

W13 = 1

W36 = -1

W23 = 3

W14 = 4

W15 = 3

W24 = 2

W25 = -2

W56 = 2

W46 = -3

Linear Activation Function (y = x) and no biases, n1 = 3, n2 = -2, y = 9, α = 0.1

Goal: Update W13

W13 = W13 - α

=

MOST IMPORTANT THING

TJ Machine Learning Club

Slide 22

23 of 27

Calculating One Neural Network Iteration

3

-2

-3

8

13

5

W13 = 1

W36 = -1

W23 = 3

W14 = 4

W15 = 3

W24 = 2

W25 = -2

W56 = 2

W46 = -3

Linear Activation Function (y = x) and no biases, n1 = 3, n2 = -2, y = 9, α = 0.1

Goal: Update W13

W13 = W13 - α

=

E = ½(n6 - y)2

n6 = W36n3 + W46n4 + W56n5

n3 = W13n1 + W23 + n2

TJ Machine Learning Club

Slide 23

24 of 27

Calculating One Neural Network Iteration

3

-2

-3

8

13

5

W13 = 1

W36 = -1

W23 = 3

W14 = 4

W15 = 3

W24 = 2

W25 = -2

W56 = 2

W46 = -3

Linear Activation Function (y = x) and no biases, n1 = 3, n2 = -2, y = 9, α = 0.1

W13 = W13 - α

=

(n6 - y) * W36 * n1

=

(-4) * -1 * 3 = 12

W13 = 1 - 0.1 * 12 = -0.2

=

TJ Machine Learning Club

Slide 24

25 of 27

NN After One Iteration

3

-2

-3

8

13

5

W13 = -0.2

W36 = -2.2

W23 = 3

W14 = 4

W15 = 3

W24 = 2

W25 = -2

W56 = 2

W46 = -3

Linear Activation Function (y = x) and no biases, n1 = 3, n2 = -2, y = 9, α = 0.1

TJ Machine Learning Club

Slide 25

26 of 27

Code (Please don’t worry about this too much)

TJ Machine Learning Club

Slide 26

27 of 27

Check these videos out when you get a chance!

TJ Machine Learning Club

Slide 27