1 of 55

Neural Networks Training

  • Fardina Fathmiul Alam

CMSC 320 - Intro to Data Science

2 of 55

How Neural Networks Learn

(Summary)

Training a Neural Network involves adjusting its weights and biases to minimize the error (or loss) between the predicted output and the actual target values. This process is achieved through Backpropagation and Gradient Descent.

3 of 55

Step: Forward

Propagation

4 of 55

Step 1: FeedForward Pass

Step 1.1 : Forward Pass: The input data is passed forward through the network, producing an output prediction.

  • Each layer computes a weighted sum of inputs and applies an activation function.
  • Outputs are generated layer by layer until the final prediction is obtained.

5 of 55

Step: Compute Loss

6 of 55

Compute Loss / Error

Step 1.2 : Calculating Loss / Error: Quantifies the difference between predicted and actual values.

  • Compare the predicted output to the expected target values to compute the error or loss, typically using a loss function such as Mean Square Error (MSE).

After the feedforward pass, the neural network produces a predicted output y^ . The true label is y.

7 of 55

Step: Back

Propagation and Gradient Descent

8 of 55

Neural Network Training: Backpropagation

Backpropagation is an algorithm used to train neural networks by minimizing the error (loss) through gradient descent.

  • A supervised learning technique that helps the network minimize its error.
  • GOAL: adjust the weights of the network to reduce the error between the predicted output and the target.
    • Why? cause the actual output to be closer the target output.

9 of 55

Backpropagation & Gradient Descent in Neural Networks

Backpropagation:

  • The process of calculating gradients of the loss function with respect to the weights and biases in the network.
  • Evaluates how much each weight and bias contributes to the error (using the chain rule to propagate the error backward through the network layers)

The gradients tell us how much to adjust each weight and bias to reduce the loss.

A gradient is the derivative of the loss function with respect to the model parameters (weights and biases). It tells us how the loss will change when the parameters are adjusted.

Backpropagation along with Gradient Descent forms the backbone and powerhouse of neural networks.

10 of 55

Backpropagation & Gradient Descent in Neural Networks

Gradient Descent Algorithm:

  • Once gradients are computed by backpropagation, Gradient Descent uses them to update weights and biases in the opposite direction (negative gradient) to minimize the loss function.
  • Iteratively adjusts parameters towards optimal values, using a learning rate to control the size of updates.

A gradient is the derivative of the loss function with respect to the model parameters (weights and biases). It tells us how the loss will change when the parameters are adjusted.

Backpropagation along with Gradient Descent forms the backbone and powerhouse of neural networks.

11 of 55

Backpropagation Key Idea

Backpropagation is the algorithm by which neural networks are trained.

Gradient of the loss function indicates how much each parameter contributed to the prediction error.

The main idea : For every training example, we compute the loss function, and then iteratively update the weights of the network by calculating the gradient of the error (loss) function with respect to each weight in the network, using the chain rule of calculus. This allows the neural network to learn by progressively reducing the error during the training process.

12 of 55

RECAP: Key Steps in Neural Network: Summary

  1. Forward Pass/Propagation
    • Input data is passed through the network layers.
    • The output is compared to the target to compute the loss.
  2. Backward Pass (Error Propagation)
    • The error is propagated backward through the network.
    • Gradients of the loss with respect to each weight are computed using the chain rule.
  3. Weight Update
    • Weights are updated using gradient descent: W=W−η⋅(∂L/∂W) where Learning rate (η) controls the size of weight updates.
  4. Repeat
    • This process is repeated over multiple epochs until the error is minimized.

Bias

Bias

Bias

13 of 55

Backpropagation: Backward Pass

Step 2: Backward Pass (Backpropagation):

Once we calculate the loss (after completing forward pass):

  • Propagate the Error Backwards: The error is sent back through the network, starting from the output layer to the input layer.

14 of 55

Backpropagation: Backward Pass

Step 2: Backward Pass (Backpropagation):

  • Calculate the Contribution of Each Weight: Using the chain rule, we figure out how much each weight contributed to the error.
    • We do this by computing the derivative (a fancy way of saying "rate of change") of the error with respect to each weight.
  • Adjustments to the network's internal parameters (weights and biases): Once we know this, we adjust the weights to make the error smaller next time.

15 of 55

Backpropagation: Computing Derivative/Gradients

Say we have the following loss function:

What we would like to know is

For all the different weights in the network, weights with a high strongly contributed to the incorrect classification (indicating they had a significant impact on the error); conversely, weights with a low had less influence on the incorrect classification.

represents the partial derivative of the loss function (L) w.r.t the weights (w) of the neural network. This derivative indicates the rate of change of the loss with respect to a particular weight.

16 of 55

Backpropagation: Use Chain Rule to find Derivative

is actually

Using the chain rule, we can compute for every node.

Once we do, we can say:

By applying the chain rule, we can calculate the influence or impact of each node on the final outcome. This helps us understand the role of each node in the network.

-

*** The chain rule tells us how to find the derivative of a composite function (one function is nested over the other)..

17 of 55

18 of 55

19 of 55

Next: Backpropagation: a simple example

20 of 55

Backpropagation: a simple example

Given, f(x,y,z) = (x+y).z

e.g. x = -2, y = 5, z = -4

(consider all weights are 1 and bias is 0 for

simplicity)

21 of 55

Backpropagation: a simple example

Given, f(x,y,z) = (x+y).z

e.g. x = -2, y = 5, z = -4

  1. Forward Pass (Left to Right): Compute Output

q = x + y = -2 + 5 = 3

f = q.z = 3 * -4 = 12

22 of 55

Backpropagation: a simple example

Given, f(x,y,z) = (x+y).z

e.g. x = -2, y = 5, z = -4

  1. Forward Pass (Left to Right): Compute Output

q = x + y = -2 + 5 = 3

f = q.z = 3 * -4 = 12

  1. Backward Pass(Right to Left): Compute Derivatives

(of the output w.r.t each of the input x,y,z)

23 of 55

Backpropagation: Compute:

Given, f(x,y,z) = (x+y).z

e.g. x = -2, y = 5, z = -4

Backward Pass: Compute Derivatives

Start with the base case

24 of 55

Backpropagation: Compute:

Given, f(x,y,z) = (x+y).z

e.g. x = -2, y = 5, z = -4

Backward Pass: Compute Derivatives

Start with the base case

: we know f = qz

= 3

3

25 of 55

Backpropagation: Compute:

Given, f(x,y,z) = (x+y).z

e.g. x = -2, y = 5, z = -4

Backward Pass: Compute Derivatives

: we know f = qz

= -4

-4

26 of 55

Backpropagation: Compute:

Given, f(x,y,z) = (x+y).z

e.g. x = -2, y = 5, z = -4

Backward Pass: Compute Derivatives

Continue process to left

Here, the value of Y is not directly connected to the output value F → so need chain rule to compute derivative

27 of 55

Backpropagation: Compute:

Given, f(x,y,z) = (x+y).z

e.g. x = -2, y = 5, z = -4

Backward Pass: Compute Derivatives

Here, the chain rule take into account the influence of Y on the intermediate variable Q.

28 of 55

Chain Rule

If y = f(g(x)), then y' = f'(g(x)). g'(x).

The chain rule states that the instantaneous rate of change of f relative to g relative to x helps us calculate the instantaneous rate of change of f relative to x.

HELPING SLIDES

29 of 55

Backpropagation: Compute:

Given, f(x,y,z) = (x+y).z

e.g. x = -2, y = 5, z = -4

Backward Pass: Compute Derivatives

= 1 * -4 =-4

30 of 55

Backpropagation: Compute:

Given, f(x,y,z) = (x+y).z

e.g. x = -2, y = 5, z = -4

Backward Pass: Compute Derivatives

= 1 * -4 =-4

31 of 55

GIVEN

32 of 55

(Upsteam)

Downstream gradients

33 of 55

Summary: Neural Networks

  • Multi-layer, feed-forward networks using backpropagation make decent classifiers.
  • They do especially well for unstructured data, like images
  • Somewhat finicky--learning rate can be hard to tune, and the number of hidden nodes can affect the output.
  • Next class, we'll learn about how they've been adapted to modern day!

34 of 55

The End

35 of 55

A Neural Network with Pytorch (NOT FOR EXAM)

36 of 55

More Complicated Example

Check next calculations by yourselves!

37 of 55

Example: given neural network

A neural network with two inputs, two hidden neurons, two output neurons. A bias is included in the the hidden and output neurons.

38 of 55

Example: given neural network

Here are the initial weights, the biases, and training inputs/outputs

For the rest of this example, we’re going to work with a single training set. Given,

  • Inputs 0.05 and 0.10
  • we want the neural network to output 0.01 and 0.99.

39 of 55

Next: Feed Forward Pass:

Focus: what the neural network currently predicts given the weights and biases and inputs of 0.05 and 0.10?

To do this we’ll feed those inputs i1,i2 forward though the network.

40 of 55

Feed Forward Pass:

Next:

Step 1: Calculate the total net input to each hidden layer neuron.

n

Total net, x = ijwj

j=1

Step 2: Squash the total net input using an activation function (e.g.: use Sigmoid function)

Repeat the process with the output layer neurons.

σ (Total net, x), where:

41 of 55

Feed Forward Pass:

Step 1: Calculate the total net input to each hidden layer neuron.

First hidden node h1:

Step 1: Calculate the total net input

neth1=(w1*i1)+(w2*i2)+b1

= (.15*0.05) +(.20*.10)+(0.35*1)

= 0.3775

Step 2: Apply Activation Function: squash it using the Sigmoid function to get the output of h1:

42 of 55

Feed Forward Pass:

Step 1: Calculate the total net input to each hidden layer neuron.

Second hidden node h2:

Step 1: Calculate the total net input

neth2=(w3*i1)+(w4*i2)+b1

Step 2: Apply Activation Function: squash it using the Sigmoid function to get the output of h1:

=

outh2=0.596884378

0.593269992

43 of 55

Feed Forward Pass:

Next: We repeat this process for the output layer neurons, using the output from the hidden layer neurons as inputs.

  1. The output for o1:

neto1=(w5*outh1)+(w6*outh2)+(b2*1)

= (.40*0.593269992

+(.45*0.596884378 +(0.60*1)

= 1.105905967

  1. Apply Activation Function:

0.593269992

0.596884378

44 of 55

Feed Forward Pass:

Next: We repeat this process for the output layer neurons, using the output from the hidden layer neurons as inputs.

  1. The output for o2:

neto1=(w7*outh1)+(w8*outh2)+(b2*1)

=

  1. Apply Activation Function:

0.593269992

0.596884378

outo2=0.772928465

0.75136507

0.772928465

45 of 55

Next: Calculating the Total Error

We can now calculate the error for each output neuron using the squared error function and sum them to get the total error:

Error = 1/2 (output/predicted -actual/target)2

The 1/2 is included so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway (Recap: )

so it doesn’t matter that we introduce a constant here

-

46 of 55

Next: Calculating the Total Error

  • The target output o1 for is 0.01
  • The neural network output outo1 is 0.75136507

So, error for o1 is:

E01= 1/2 (output/predicted -actual/target)2

= 1/2 (0.75136507 -0.01)2

= ½* 0.549622167

= 0.274811083

0.75136507

0.772928465

47 of 55

Next: Calculating the Total Error

  • Repeating this process for o2 (remembering that the target is 0.99) we get:

Error for o2 is, E02 : 0.023560026

= 0.274811083 + 0.023560026

= 0.298371109

0.75136507

0.772928465

48 of 55

Next: Calculating the Total Error

  • Repeating this process for o2 (remembering that the target is 0.99) we get:

Error for o2 is, E02 : 0.023560026

Total Error, ETotal = E01 + E02

= 0.274811083 + 0.023560026

= 0.298371109

0.75136507

0.772928465

49 of 55

Recap: Backward Pass (Back Propagation)

Goal with backpropagation: Update each of the weights in the network

Why? so that they cause the actual output to be closer the target output,

How? Minimize the error for each output neuron and the network as a whole.

  • How? The error is then propagated backward through the network.
  • Using the chain rule, it calculates the contribution of each weight to the overall error.
  • The derivative of the error with respect to each weight is computed, starting from w5 to w8, and then w1 to w4, adjusting the weights to minimize the error.

50 of 55

Backpropagation

Next, we will describe how can we calculates the contribution of weight w5 to the overall error (Etotal)

51 of 55

Backward Pass

Output Layer: Consider w5.

We want to know how much a change in w5 affects the total error ETotal :

= the partial derivative of ETotal with

respect to w5.

(we can also say the gradient with respect to w5)

52 of 55

By Applying Chain Rule we know

In the backpropagation process, working backward involves understanding how to start from the total error (Etotal) and trace back to the weight w5.

  • For instance, to determine Etotal, ​, we need to know the output o1.
  • This output o1 is influenced by the weighted sum net o1 ,
  • and one of the contributing weights to this neto1 calculation is w5.

Therefore, the chain of dependencies follows:

Etotal → o1 → net01 → w5

adjust w5 to minimize the total error

We need to figure out each piece in this equation.

53 of 55

After we computed its respective gradient ( ) b ), we update the weight w5 using the gradient descent formula:

Same process is applicable for other subsequent weights.

= W5new=w5 − α

-

54 of 55

Additional (If you want to see more detailed calculation)

https://hmkcode.com/ai/backpropagation-step-by-step/

55 of 55

Interpretation of Partial Derivatives

HELPING SLIDES