1 of 16

Neural Networks II: Forward Propagation with Matrices + Intro to Calc

TJ Machine Learning

Slide 1

TJ Machine Learning Club

2 of 16

Credits

  • All uncited images are either from Wikipedia or from TJML lectures on neural networks
  • Additionally, the examples presented in this lecture are from from past TJML lectures, drafted and edited by Nikhil Sardana and Vinay Bhaip

Slide 2

TJ Machine Learning Club

3 of 16

Review of the Perceptron

The formal equations that represent the perceptron are:

Step Function

Note that there would be more terms in f(x) if we had more inputs to the perceptron

b

Slide 3

TJ Machine Learning Club

4 of 16

The Neuron

  • Inputs come in on the left and are multiplied by their corresponding weights (one weight for each input)
  • Bias added to the sum (this allows us to learn more complex functions)
  • Nonlinearity function is applied to resulting sum (this also allows us to learn more complex functions)

b

Slide 4

TJ Machine Learning Club

5 of 16

Nonlinearity functions

  • There are many possible nonlinearity functions. Here are a few:
  • As mentioned on the last slide, without non-linearity functions, computation can be reduced to one perceptron. Nonlinearity functions allow networks to learn nonlinear mappings from input to output

Sigmoid Function

ReLU

tanh

Slide 5

TJ Machine Learning Club

6 of 16

Neural Networks

  • Hook multiple neurons up to form neural networks!
  • Can be used for both regression and classification tasks
  • Make them more powerful by adding more layers or adding more neurons in each layer

From 3Blue1Brown

Slide 6

TJ Machine Learning Club

7 of 16

Forward Propagation

Formalizing the process we outlined above, forward propagation for the network on the bottom left would look like this:

Slide 7

TJ Machine Learning Club

8 of 16

Multiplying Matrices

  • Matrices are a convenient way to order numbers for neural network computation
  • Organizing our computations as matrix operations helps us speed up computation using GPUs
  • Matrices to be multiplied must be of dimensions

If dimensions

Of matrix 1 = (a x b) and

Of matrix 2 = (b x c)

Then final dimensions are: (a x c)

From mathisfun.com

Slide 8

TJ Machine Learning Club

9 of 16

Vectorized Forward Propagation

We can express forward propagation in terms of matrices and see massive speedups in our computation

Each column corresponds to a neuron in the previous layer

Each row corresponds to a neuron in the current layer

More generally:

Slide 9

TJ Machine Learning Club

10 of 16

Error

There are a variety of error functions we can use to quantify our performance on some task:

  • Popular loss functions include:
    • Mean Squared Error (for regression tasks, like stock prediction)
    • Binary Cross Entropy (for binary classification tasks, like predicting whether or not brain in fMRI has Alzheimer's)
  • Now that we have ways to quantify how bad our network is doing, our goal should be to modify the network so it becomes “less bad”
    • To do this, we perform gradient descent to minimize the error

Binary Cross Entropy From Towards Data Science

Mean Squared Error

Slide 10

TJ Machine Learning Club

11 of 16

Derivatives

  • Derivatives specify the rate of change of some function with respect to some variable
    • Essentially, if I change x by some quantity, how much will y change?
  • Derivatives can also be extended to multivariable functions (where output is dependent on more than one input, as is the case with neural networks)
    • We calculate partial derivatives by changing one variable while holding all other variables constant

y = 2x +3

dy/dx = 2

y = x^2

dy/dx = 2x

Slide 11

TJ Machine Learning Club

12 of 16

Partial Derivatives

  • The idea of a derivative can be extended to multiple dimensions
  • With partial derivatives, we see the change in output as we change on input variable while holding all other input variables constant
  • Can be visualized in 3D as taking a slice of the graph and measuring the rate of change

From Khan Academy

Slide 12

TJ Machine Learning Club

13 of 16

Gradients

  • A gradient is a collection of the partial derivatives for all variables which some other variable is dependent on
  • The gradient vector specifies the direction of steepest ascent
  • If we view our error function as a function of all the weights and biases of our network, the direction specified by the negative of our gradient tells us the direction to change our weights and biases to decrease our error

From Khan Academy

From faculty.etsu.edu

Slide 13

TJ Machine Learning Club

14 of 16

Gradient Descent

  • Keep computing gradient of error function (with respect to the weights and biases) using backpropagation
    • Move in direction opposite to the direction of gradient (which is actually the direction of steepest ascent)
    • Stop when magnitude of gradient reaches some low threshold (signifying convergence)
  • Disadvantage: Is not guaranteed to find global minimum (might get stuck in a local min)

Why not just find the global minimum? → Hard to do with the complexity of large multivariable functions like neural networks

Slide 14

TJ Machine Learning Club

15 of 16

Backpropagation

  • The output of the neural network is the result of a composite function of all the weights and biases of the network
  • Repeatedly use the chain rule to find partial derivatives of the error with respect to the weights and biases of our network
    • Once we have the gradient, we adjust the values of all the weights and biases in the direction opposite to the gradient
    • This is effectively taking the step in the direction that most quickly decreases the error

From 3Blue1Brown

Slide 15

TJ Machine Learning Club

16 of 16

Slide 16

TJ Machine Learning Club