1 of 27

Neural Networks II: Backpropagation

Vatsal Sivaratri

Modified from last year’s presentation

TJ Machine Learning Club

Slide 1

2 of 27

Review

TJ Machine Learning Club

Slide 2

3 of 27

The Perceptron

Linear function followed by a nonlinear activation function

Weights

Bias

TJ Machine Learning Club

Slide 3

4 of 27

Today’s Goal

Understand how we find the best values for the weights and biases through gradient descent

Weights

Bias

TJ Machine Learning Club

Slide 4

5 of 27

But first, Calculus! (At least the differential part)

Algebraic Approach:

Calculus Approach:

TJ Machine Learning Club

Slide 5

6 of 27

IMPORTANT DISTINCTION!!!

This is a command!

= “Take the derivative with respect to x”

This is a value!

= “The derivative of y with respect to x”

TJ Machine Learning Club

Slide 6

7 of 27

Rest of the Basic Calculus Rules

TJ Machine Learning Club

Slide 7

8 of 27

The Most Important - Chain Rule!

TJ Machine Learning Club

Slide 8

9 of 27

The Only Multi- you’ll need!

Partial Derivatives:

Move from dy/dx to ∂, measuring change in one variable while others stay constant.

Gradients:

"Extend derivatives to vectors; gradients show the steepest ascent in multiple dimensions."

Credit: https://calcworkshop.com

TJ Machine Learning Club

Slide 9

10 of 27

Minimizing Error: Gradient Descent

TJ Machine Learning Club

Slide 10

11 of 27

The Intuitive Explanation

The ball will roll down a hill

Direction we push the weight

TJ Machine Learning Club

Slide 11

12 of 27

The Intuitive Explanation

In a very high dimension space → each weight/bias is a different axis
Why not just find the global minimum? → Hard to do in high dimension spaces, too many possibilities

TJ Machine Learning Club

Slide 12

13 of 27

The Gradient

The gradient is the opposite of the direction that we push in, the direction of steepest ascent of loss

Gradients

(also called a derivative)

Subtraction gives us descent

TJ Machine Learning Club

Slide 13

14 of 27

The Gradient as Slope

TJ Machine Learning Club

Slide 14

15 of 27

The Learning Rate

𝛼 is the learning rate, and is generally a small positive value
It scales how big a step we make

Large Alpha = Big Step
Small Alpha = Small Step

TJ Machine Learning Club

Slide 15

16 of 27

Optimizing the Learning Rate

Getting the learning rate right is one of the most important parts of Neural Network training!

TJ Machine Learning Club

Slide 16

17 of 27

TJ Machine Learning Club

Slide 17

18 of 27

Calculating One Neural Network Iteration

-2

-3

W₁₃ = 1

W₃₆ = -1

W₂₃ = 3

W₁₄ = 4

W₁₅ = 3

W₂₄ = 2

W₂₅ = -2

W₅₆ = 2

W₄₆ = -3

Linear Activation Function (y = x) and no biases, n₁ = 3, n₂ = -2, y = 9, α = 0.1

Goal: Update W₃₆

W₃₆ = W₃₆ - α

Loss (Doesn’t need to be this way!)

TJ Machine Learning Club

Slide 18

19 of 27

Calculating One Neural Network Iteration

-2

-3

W₁₃ = 1

W₃₆ = -1

W₂₃ = 3

W₁₄ = 4

W₁₅ = 3

W₂₄ = 2

W₂₅ = -2

W₅₆ = 2

W₄₆ = -3

Linear Activation Function (y = x) and no biases, n₁ = 3, n₂ = -2, y = 9, α = 0.1

E = ½(n₆ - y)²

n₆ = W₃₆n₃ + W₄₆n₄ + W₅₆n₅

MOST IMPORTANT THING

TJ Machine Learning Club

Slide 19

20 of 27

Calculating One Neural Network Iteration

-2

-3

W₁₃ = 1

W₃₆ = -1

W₂₃ = 3

W₁₄ = 4

W₁₅ = 3

W₂₄ = 2

W₂₅ = -2

W₅₆ = 2

W₄₆ = -3

Linear Activation Function (y = x) and no biases, n₁ = 3, n₂ = -2, y = 9, α = 0.1

E = ½(n₆ - y)²

n₆ = W₃₆n₃ + W₄₆n₄ + W₅₆n₅

W₃₆ = W₃₆ - α

W₃₆ = -1 - 0.1(12) = -1 - 1.2 = -2.2

= (n₆ - y) * n₃ = -4 * -3 = 12

TJ Machine Learning Club

Slide 20

21 of 27

Calculating One Neural Network Iteration

-2

-3

W₁₃ = 1

W₃₆ = -1

W₂₃ = 3

W₁₄ = 4

W₁₅ = 3

W₂₄ = 2

W₂₅ = -2

W₅₆ = 2

W₄₆ = -3

Linear Activation Function (y = x) and no biases, n₁ = 3, n₂ = -2, y = 9, α = 0.1

Goal: Update W₁₃

W₁₃ = W₁₃ - α

Loss (Doesn’t need to be this way!)

TJ Machine Learning Club

Slide 21

22 of 27

Calculating One Neural Network Iteration

-2

-3

W₁₃ = 1

W₃₆ = -1

W₂₃ = 3

W₁₄ = 4

W₁₅ = 3

W₂₄ = 2

W₂₅ = -2

W₅₆ = 2

W₄₆ = -3

Linear Activation Function (y = x) and no biases, n₁ = 3, n₂ = -2, y = 9, α = 0.1

Goal: Update W₁₃

W₁₃ = W₁₃ - α

MOST IMPORTANT THING

TJ Machine Learning Club

Slide 22

23 of 27

Calculating One Neural Network Iteration

-2

-3

W₁₃ = 1

W₃₆ = -1

W₂₃ = 3

W₁₄ = 4

W₁₅ = 3

W₂₄ = 2

W₂₅ = -2

W₅₆ = 2

W₄₆ = -3

Linear Activation Function (y = x) and no biases, n₁ = 3, n₂ = -2, y = 9, α = 0.1

Goal: Update W₁₃

W₁₃ = W₁₃ - α

E = ½(n₆ - y)²

n₆ = W₃₆n₃ + W₄₆n₄ + W₅₆n₅

n₃ = W₁₃n₁ + W₂₃ + n₂

TJ Machine Learning Club

Slide 23

24 of 27

Calculating One Neural Network Iteration

-2

-3

W₁₃ = 1

W₃₆ = -1

W₂₃ = 3

W₁₄ = 4

W₁₅ = 3

W₂₄ = 2

W₂₅ = -2

W₅₆ = 2

W₄₆ = -3

Linear Activation Function (y = x) and no biases, n₁ = 3, n₂ = -2, y = 9, α = 0.1

W₁₃ = W₁₃ - α

(n₆ - y)* W₃₆ * n₁

(-4)* -1 * 3 = 12

W₁₃ = 1 - 0.1 * 12 = -0.2

TJ Machine Learning Club

Slide 24

25 of 27

NN After One Iteration

-2

-3

W₁₃ = -0.2

W₃₆ = -2.2

W₂₃ = 3

W₁₄ = 4

W₁₅ = 3

W₂₄ = 2

W₂₅ = -2

W₅₆ = 2

W₄₆ = -3

Linear Activation Function (y = x) and no biases, n₁ = 3, n₂ = -2, y = 9, α = 0.1

TJ Machine Learning Club

Slide 25

1 of 27

2 of 27

3 of 27

4 of 27

5 of 27

6 of 27

7 of 27

8 of 27

9 of 27

10 of 27

11 of 27

12 of 27

13 of 27

14 of 27

15 of 27

16 of 27

17 of 27

18 of 27

19 of 27

20 of 27

21 of 27

22 of 27

23 of 27

24 of 27

25 of 27

26 of 27

27 of 27