1 of 24

�

Data Mining_Anoop Chaturvedi

Swayam Prabha

Course Title

Multivariate Data Mining- Methods and Applications

Lecture 24

Backpropagation of Errors Algorithm

Anoop Chaturvedi

Department of Statistics, University of Allahabad

Prayagraj (India)

Slides can be downloaded from https://sites.google.com/view/anoopchaturvedi/swayam-prabha

2 of 24

Data Mining_Anoop Chaturvedi

3 of 24

Data Mining_Anoop Chaturvedi

4 of 24

Data Mining_Anoop Chaturvedi

5 of 24

Data Mining_Anoop Chaturvedi

6 of 24

Data Mining_Anoop Chaturvedi

7 of 24

Data Mining_Anoop Chaturvedi

8 of 24

Data Mining_Anoop Chaturvedi

9 of 24

Data Mining_Anoop Chaturvedi

10 of 24

Data Mining_Anoop Chaturvedi

11 of 24

Data Mining_Anoop Chaturvedi

12 of 24

Data Mining_Anoop Chaturvedi

13 of 24

Data Mining_Anoop Chaturvedi

14 of 24

To summarize, Backpropagation algorithm involves two main steps:

Forward propagation ⇒ Inputs are passed through the network to generate predictions, and activations are computed sequentially for each layer,
Backward propagation ⇒ Errors are propagated backward to update the weights. Gradients of the loss function with respect to the weights are computed layer by layer using the chain rule of calculus. These gradients are used to update the weights through optimization techniques such as gradient descent.

Data Mining_Anoop Chaturvedi

15 of 24

Initial Values

If initial weights are exactly zero, it leads to zero derivatives, and the algorithm never moves.
Starting with large weights leads to poor solutions.
If the weights are near zero, the operative part of the sigmoid is nearly linear. Thus, the model starts out nearly linear, and becomes nonlinear as the weights increase.
Usually initial values for parameter estimates are chosen using small random-generated (uniformly distributed with small variance).

Data Mining_Anoop Chaturvedi

16 of 24

Data Mining_Anoop Chaturvedi

17 of 24

Data Mining_Anoop Chaturvedi

18 of 24

Data Mining_Anoop Chaturvedi

19 of 24

When all training samples are used to create one batch, the learning algorithm is batch gradient descent.
When the batch is the size of one sample, the learning algorithm is stochastic gradient descent.
When the batch size is more than one and less than the training dataset size, the learning algorithm is called mini-batch gradient descent.

Data Mining_Anoop Chaturvedi

20 of 24

Input Scaling:

Before fitting scale inputs to interval [0,1], [-1,1] or standardize have 0 mean and unit variance.

How many Hidden nodes and Layers?

Ockham’s razor ⇒ Keep the model as simple as possible while maintaining its ability to generalize well.

Employ Cross-Validation ⇒ Presence of multiple local minima at each iteration. Which solution should be used for each round of CV?

Data Mining_Anoop Chaturvedi

21 of 24

Decide from the context of the problem or trial and error.
Each layer extracts features of the input for regression/ classification. Use of multiple hidden layers allows construction of hierarchical features at different levels.

Over fitting and Network Pruning

Reduce number of parameters and also keep performance characteristics.
Set insignificant parameters equal to zero (optimal brain tuning).

Data Mining_Anoop Chaturvedi

22 of 24

Data Mining_Anoop Chaturvedi

23 of 24

Data Mining_Anoop Chaturvedi

24 of 24

Data Mining_Anoop Chaturvedi

Rotation

Mirroring

Original

Translation

Scaling

MLP is trained for learning a map from input to output, where input is from a fixed location.

It is not feasible to train the MLP for all possible transformations and combinations.

Thus, MLP may not be able to identify the digits or images in the presence of these transformations.