1 of 16

Neural Networks II: Forward Propagation with Matrices + Intro to Calc

TJ Machine Learning

Slide 1

TJ Machine Learning Club

2 of 16

Credits

All uncited images are either from Wikipedia or from TJML lectures on neural networks
Additionally, the examples presented in this lecture are from from past TJML lectures, drafted and edited by Nikhil Sardana and Vinay Bhaip

Slide 2

TJ Machine Learning Club

3 of 16

Review of the Perceptron

The formal equations that represent the perceptron are:

Step Function

Note that there would be more terms in f(x) if we had more inputs to the perceptron

Slide 3

TJ Machine Learning Club

4 of 16

The Neuron

Inputs come in on the left and are multiplied by their corresponding weights (one weight for each input)
Bias added to the sum (this allows us to learn more complex functions)
Nonlinearity function is applied to resulting sum (this also allows us to learn more complex functions)

Slide 4

TJ Machine Learning Club

5 of 16

Nonlinearity functions

There are many possible nonlinearity functions. Here are a few:
As mentioned on the last slide, without non-linearity functions, computation can be reduced to one perceptron. Nonlinearity functions allow networks to learn nonlinear mappings from input to output

Sigmoid Function

ReLU

tanh

Slide 5

TJ Machine Learning Club

6 of 16

Neural Networks

Hook multiple neurons up to form neural networks!
Can be used for both regression and classification tasks
Make them more powerful by adding more layers or adding more neurons in each layer

From 3Blue1Brown

Slide 6

TJ Machine Learning Club

7 of 16

Forward Propagation

Formalizing the process we outlined above, forward propagation for the network on the bottom left would look like this:

Slide 7

TJ Machine Learning Club

8 of 16

Multiplying Matrices

Matrices are a convenient way to order numbers for neural network computation
Organizing our computations as matrix operations helps us speed up computation using GPUs
Matrices to be multiplied must be of dimensions

If dimensions

Of matrix 1 = (a x b) and

Of matrix 2 = (b x c)

Then final dimensions are: (a x c)

From mathisfun.com

Slide 8

TJ Machine Learning Club

9 of 16

Vectorized Forward Propagation

We can express forward propagation in terms of matrices and see massive speedups in our computation

Each column corresponds to a neuron in the previous layer

Each row corresponds to a neuron in the current layer

More generally:

Slide 9

TJ Machine Learning Club

10 of 16

Error

There are a variety of error functions we can use to quantify our performance on some task:

Popular loss functions include:

Mean Squared Error (for regression tasks, like stock prediction)
Binary Cross Entropy (for binary classification tasks, like predicting whether or not brain in fMRI has Alzheimer's)

Now that we have ways to quantify how bad our network is doing, our goal should be to modify the network so it becomes “less bad”

To do this, we perform gradient descent to minimize the error

Binary Cross Entropy From Towards Data Science

Mean Squared Error

Slide 10

TJ Machine Learning Club

11 of 16

Derivatives

Derivatives specify the rate of change of some function with respect to some variable

Essentially, if I change x by some quantity, how much will y change?

Derivatives can also be extended to multivariable functions (where output is dependent on more than one input, as is the case with neural networks)

We calculate partial derivatives by changing one variable while holding all other variables constant

y = 2x +3

dy/dx = 2

y = x^2

dy/dx = 2x

Slide 11

TJ Machine Learning Club

12 of 16

Partial Derivatives

The idea of a derivative can be extended to multiple dimensions
With partial derivatives, we see the change in output as we change on input variable while holding all other input variables constant
Can be visualized in 3D as taking a slice of the graph and measuring the rate of change

From Khan Academy

Slide 12

TJ Machine Learning Club

13 of 16

Gradients

A gradient is a collection of the partial derivatives for all variables which some other variable is dependent on
The gradient vector specifies the direction of steepest ascent
If we view our error function as a function of all the weights and biases of our network, the direction specified by the negative of our gradient tells us the direction to change our weights and biases to decrease our error

From Khan Academy

From faculty.etsu.edu

Slide 13

TJ Machine Learning Club

14 of 16

Gradient Descent

Keep computing gradient of error function (with respect to the weights and biases) using backpropagation

Move in direction opposite to the direction of gradient (which is actually the direction of steepest ascent)
Stop when magnitude of gradient reaches some low threshold (signifying convergence)

Disadvantage: Is not guaranteed to find global minimum (might get stuck in a local min)

Why not just find the global minimum? → Hard to do with the complexity of large multivariable functions like neural networks

Slide 14

TJ Machine Learning Club

15 of 16

Backpropagation

The output of the neural network is the result of a composite function of all the weights and biases of the network
Repeatedly use the chain rule to find partial derivatives of the error with respect to the weights and biases of our network

Once we have the gradient, we adjust the values of all the weights and biases in the direction opposite to the gradient
This is effectively taking the step in the direction that most quickly decreases the error

From 3Blue1Brown

Slide 15

TJ Machine Learning Club

1 of 16

2 of 16

3 of 16

4 of 16

5 of 16

6 of 16

7 of 16

8 of 16

9 of 16

10 of 16

11 of 16

12 of 16

13 of 16

14 of 16

15 of 16

16 of 16