1 of 66

Fundamentals of Deep Neural Networks

1

2 of 66

First Things to Note

  • In Machine Learning / Deep Learning

    • We almost always deal with Mathematical function

    • A model almost always refers to a mathematical function

2

INPUT

F()

Output / Target

3 of 66

ML: Design Pattern

3

Problem statement:

    • Know your data well
    • What is the input?
    • What is your output /target

Define your model:

    • Choose a reasonable math function
      • Contains unknown parameters

Define the objective function:

    • Error / loss / profit

Optimize the objective:

    • Use training data and an optimization method to estimate parameters

Now you have all for your model:

    • Freeze the model and start using it

4 of 66

Let us Start with a Problem

4

5 of 66

Movie Recommendation

5

    • You have two friends John and Mary. You have prior data telling the movies you liked or disliked that they rated. Now a new movie named ‘gravity’ has been released that they rated and you want to decide whether you should watch it.

6 of 66

Know Your Data

6

Sample/example

Features /

input variables

Output / target

7 of 66

Decision Boundary

7

8 of 66

Linear Decision Boundary

  •  

8

Class 0

Class 1

9 of 66

Non-linear boundary

9

10 of 66

Optimization: Gradient Descent

10

11 of 66

Review of Differential Calculus

  •  

11

 

 

 

What is the sign of here?

 

What is the sign of here?

 

12 of 66

Local and Global minima

12

x

F(x)

13 of 66

Gradient Descent

  •  

13

Input

Output

Training data

14 of 66

Gradient Descent

  • Role of gradient descent
    • To find the values of parameters that minimize an objective function

14

 

 

we want to reach here

 

15 of 66

Gradient Descent on Multivariate Function

  •  

15

16 of 66

Gradient Descent Algorithm

16

 

  • How to check convergence?
    • Two widely used approaches
      1. Iterate a fixed number of times
      2. Stop if parameters do not change much in two consecutive iteration

17 of 66

Limitations of Gradient Descent

  •  

17

18 of 66

 

18

Small value of r

slow convergence

Large value of r

Oscillation / overshooting

19 of 66

Adaptive Gradient Descent

19

20 of 66

Adagrad

  • Main Idea:

    • Learning rate is different for different parameters

    • Learning rate of a parameter depends on its previous gradients

20

21 of 66

Adagrad

  •  

21

 

22 of 66

Adagrad

  •  

22

23 of 66

Adaptive Moment Estimation�(ADAM)

23

24 of 66

Deep Neural Network

24

25 of 66

Motivation

25

26 of 66

A Motivating Problem

26

  • Movie recommendation:
    • You have two friends John and Mary. You have prior data telling the movies you liked or disliked that they rated. Now a new movie named ‘gravity’ has been released that they rated and you want to decide whether you should watch it.

27 of 66

Movie Recommendation

27

  • Plot of the data:

You liked

You disliked

New movie

Observation: the classes can be separated by straight line

28 of 66

Movie Recommendation

28

  • How can you predict the label for new movie, gravity?

    • Answer: Logistic regression

29 of 66

Logistic Regression: Graphical View

29

f(z)

output

 

 

 

 

 

 

Non-Linear Decision Boundary

30 of 66

Limitations of Linear Decision Function

  • Now consider a data like this:

30

The decision function involves two straight lines

Thus, we need to combine information from two straight lines

31 of 66

Addressing Non-linearity

31

Here is a good network

f

f

f

output

 

 

 

 

 

 

 

 

  1. The shaded nodes are called hidden nodes / neurons

  • Each hidden node corresponds to one straight line in the data

Moral of the story:

Hidden nodes help make decision in complex cases

In ML, the complex case means highly non-linear decision boundary that separates the classes

More hidden layers you take, more complex decision your model can make

 

 

32 of 66

Structure of Deep Network

32

33 of 66

Structure of Deep Network

33

output

  1. One input layer with a set of nodes that take the feature values for a sample
    • # nodes in this layer = dimension of the data = # features

  • One output layer with a set of nodes, that gives the class labels
    • # nodes in this layer = # class label in the training data

  • Multiple hidden layers, each layer contains a set of nodes
    • Higher the number of hidden layers, deeper the network

Components of Deep network

34 of 66

Ingredients of a Deep Network

34

35 of 66

Ingredients of a Deep Network

35

Things you need to have in order to describe a deep network:

 

36 of 66

Ingredients of a Deep Network

36

What does a non-input neuron of the network do?

Each node takes inputs from nodes from previous layer,

linearly combines them,

and then passes through the activation function

 

37 of 66

Activation Function

37

38 of 66

Deep vs. Shallow Network

38

What is the # features of the data the deep network trying to model?

How many class labels are there in that data?

How many parameters the deep net has?

39 of 66

Training Deep Network

39

40 of 66

Backpropagation Algorithm���The Pillar of Deep Learning

40

41 of 66

Training Neural Network

  • Let us start with the simplest network:
    • One feature, 2 class, 2 hidden nodes

41

x

1

2

3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

42 of 66

Training Neural Network

42

x

1

2

3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

43 of 66

Training Neural Network

43

x

1

2

3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  • How can we find good values of parameters?
    • Answer: By gradient descent

 

 

44 of 66

Training Neural Network

44

x

1

2

3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Let us compute gradients:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The chain rule of derivative

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

45 of 66

Training Neural Network: Computing Gradients

45

x

1

2

3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

gradients: the common parts

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

This method is known as backpropagation

46 of 66

46

 

 

 

 

 

 

  1. Initialize the network parameters (weights and biases)

  • Choose and optimization method (vanilla SGD, ADAM etc)

  • Repeat the following steps (until you are happy with the result)

    • Take a forward pass for an input sample

    • Compute the cost function

    • Compute the gradients of the cost with respect to parameters using backpropagation

    • Update each parameter using the gradients, according to the optimization algorithm

47 of 66

47

Other Important Details

48 of 66

Parameter Initialization

  • Very large initialization leads to exploding gradients

  • Very small initialization leads to vanishing gradients

  • We need to maintain a balance

48

49 of 66

Initialization

  • Xavier initialization

49

 

50 of 66

Initialization

  • Kaiming Initialization

50

 

 

51 of 66

Computing Loss

51

52 of 66

Cross Entropy

52

53 of 66

Regularization

53

54 of 66

Improving Single Model Performance

54

55 of 66

Regularization:

55

  • Regularization techniques are essential in deep learning to prevent overfitting, improve generalization, and ensure that the model performs well on unseen data.
  • Overfitting occurs when a model learns the noise in the training data instead of the underlying pattern, leading to poor performance on new, unseen data.

Common Regularization Technique:

  • L1 and L2 Regularization (Weight Decay)
  • Dropout
  • Early Stopping
  • Batch Normalization
  • Data Augmentation

56 of 66

  • Key idea
    • Add a term to the error/loss function

56

Regularization: L1 and L2 Regularization (Weight Decay)

 

57 of 66

  • Key idea
    • Add a term to the error/loss function

57

Regularization: L1 and L2 Regularization (Weight Decay)

 

58 of 66

Regularization: Dropout

58

 

59 of 66

Regularization: Dropout

59

Key Idea: During training, randomly drop some neurons. Probability of dropping is a hyper-parameter.

Srivastava et. al.

60 of 66

Regularization: Early Stopping

60

Key Idea:

Early stopping involves monitoring the model's performance on a validation set during training and stopping training when the performance stops improving, which prevents overfitting by not allowing the model to train too long on the training data.

61 of 66

Regularization: Batch Normalization

61

Key Idea: Normalizes the inputs of each layer to have zero mean and unit variance.

It helps stabilize and accelerate training by reducing internal covariate shift.

 

62 of 66

Regularization: Data Augmentation

62

Source: Fei Fei Li

63 of 66

Data Augmentation: Image Transformation

63

Source: Fei Fei Li

64 of 66

Data Augmentation: Random Crops and Scales

64

Source: Fei Fei Li

  • During training add random crops

  • Resize training images

  • Sample random patch

65 of 66

Data Augmentation: Color Changes

65

Source: Fei Fei Li

  • Randomize contrast and brightness

66 of 66

Thank you!

66