JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 66

Fundamentals of Deep Neural Networks

2 of 66

First Things to Note

In Machine Learning / Deep Learning

We almost always deal with Mathematical function

A model almost always refers to a mathematical function

INPUT

F()

Output / Target

3 of 66

ML: Design Pattern

Problem statement:

Know your data well
What is the input?
What is your output /target

Define your model:

Choose a reasonable math function

Contains unknown parameters

Define the objective function:

Error / loss / profit

Optimize the objective:

Use training data and an optimization method to estimate parameters

Now you have all for your model:

Freeze the model and start using it

4 of 66

Let us Start with a Problem

5 of 66

Movie Recommendation

You have two friends John and Mary. You have prior data telling the movies you liked or disliked that they rated. Now a new movie named ‘gravity’ has been released that they rated and you want to decide whether you should watch it.

6 of 66

Know Your Data

Sample/example

Features /

input variables

Output / target

7 of 66

Decision Boundary

8 of 66

Linear Decision Boundary

Class 0

Class 1

9 of 66

Non-linear boundary

10 of 66

Optimization: Gradient Descent

11 of 66

Review of Differential Calculus

What is the sign of here?

12 of 66

Local and Global minima

F(x)

13 of 66

Gradient Descent

Input	Output

Training data

14 of 66

Gradient Descent

Role of gradient descent

To find the values of parameters that minimize an objective function

we want to reach here

15 of 66

Gradient Descent on Multivariate Function

16 of 66

Gradient Descent Algorithm

How to check convergence?

Two widely used approaches

Iterate a fixed number of times
Stop if parameters do not change much in two consecutive iteration

17 of 66

Limitations of Gradient Descent

18 of 66

Small value of r

slow convergence

Large value of r

Oscillation / overshooting

19 of 66

Adaptive Gradient Descent

20 of 66

Adagrad

Main Idea:

Learning rate is different for different parameters

Learning rate of a parameter depends on its previous gradients

21 of 66

Adagrad

22 of 66

Adagrad

23 of 66

Adaptive Moment Estimation�(ADAM)

24 of 66

Deep Neural Network

25 of 66

Motivation

26 of 66

A Motivating Problem

Movie recommendation:

You have two friends John and Mary. You have prior data telling the movies you liked or disliked that they rated. Now a new movie named ‘gravity’ has been released that they rated and you want to decide whether you should watch it.

27 of 66

Movie Recommendation

Plot of the data:

You liked

You disliked

New movie

Observation: the classes can be separated by straight line

28 of 66

Movie Recommendation

How can you predict the label for new movie, gravity?

Answer: Logistic regression

29 of 66

Logistic Regression: Graphical View

f(z)

output

Non-Linear Decision Boundary

30 of 66

Limitations of Linear Decision Function

Now consider a data like this:

The decision function involves two straight lines

Thus, we need to combine information from two straight lines

31 of 66

Addressing Non-linearity

Here is a good network

output

The shaded nodes are called hidden nodes / neurons

Each hidden node corresponds to one straight line in the data

Moral of the story:

Hidden nodes help make decision in complex cases

In ML, the complex case means highly non-linear decision boundary that separates the classes

More hidden layers you take, more complex decision your model can make

32 of 66

Structure of Deep Network

33 of 66

Structure of Deep Network

output

One input layer with a set of nodes that take the feature values for a sample

# nodes in this layer = dimension of the data = # features

One output layer with a set of nodes, that gives the class labels

# nodes in this layer = # class label in the training data

Multiple hidden layers, each layer contains a set of nodes

Higher the number of hidden layers, deeper the network

Components of Deep network

34 of 66

Ingredients of a Deep Network

35 of 66

Ingredients of a Deep Network

Things you need to have in order to describe a deep network:

36 of 66

Ingredients of a Deep Network

What does a non-input neuron of the network do?

Each node takes inputs from nodes from previous layer,

linearly combines them,

and then passes through the activation function

37 of 66

Activation Function

38 of 66

Deep vs. Shallow Network

What is the # features of the data the deep network trying to model?

How many class labels are there in that data?

How many parameters the deep net has?

39 of 66

Training Deep Network

40 of 66

Backpropagation Algorithm��The Pillar of Deep Learning

41 of 66

Training Neural Network

Let us start with the simplest network:

One feature, 2 class, 2 hidden nodes

42 of 66

Training Neural Network

43 of 66

Training Neural Network

How can we find good values of parameters?

Answer: By gradient descent

44 of 66

Training Neural Network

Let us compute gradients:

The chain rule of derivative

45 of 66

Training Neural Network: Computing Gradients

gradients: the common parts

This method is known as backpropagation

46 of 66

Initialize the network parameters (weights and biases)

Choose and optimization method (vanilla SGD, ADAM etc)

Repeat the following steps (until you are happy with the result)

Take a forward pass for an input sample

Compute the cost function

Compute the gradients of the cost with respect to parameters using backpropagation

Update each parameter using the gradients, according to the optimization algorithm

47 of 66

Other Important Details

48 of 66

Parameter Initialization

Very large initialization leads to exploding gradients

Very small initialization leads to vanishing gradients

We need to maintain a balance

49 of 66

Initialization

Xavier initialization

50 of 66

Initialization

Kaiming Initialization

51 of 66

Computing Loss

52 of 66

Cross Entropy

53 of 66

Regularization

54 of 66

Improving Single Model Performance

55 of 66

Regularization:

Regularization techniques are essential in deep learning to prevent overfitting, improve generalization, and ensure that the model performs well on unseen data.
Overfitting occurs when a model learns the noise in the training data instead of the underlying pattern, leading to poor performance on new, unseen data.

Common Regularization Technique:

L1 and L2 Regularization (Weight Decay)
Dropout
Early Stopping
Batch Normalization
Data Augmentation

56 of 66

Key idea

Add a term to the error/loss function

Regularization: L1 and L2 Regularization (Weight Decay)

57 of 66

Key idea

Add a term to the error/loss function

Regularization: L1 and L2 Regularization (Weight Decay)

58 of 66

Regularization: Dropout

59 of 66

Regularization: Dropout

Key Idea: During training, randomly drop some neurons. Probability of dropping is a hyper-parameter.

Srivastava et. al.

60 of 66

Regularization: Early Stopping

Key Idea:

Early stopping involves monitoring the model's performance on a validation set during training and stopping training when the performance stops improving, which prevents overfitting by not allowing the model to train too long on the training data.

61 of 66

Regularization: Batch Normalization

Key Idea: Normalizes the inputs of each layer to have zero mean and unit variance.

It helps stabilize and accelerate training by reducing internal covariate shift.

62 of 66

Regularization: Data Augmentation

Source: Fei Fei Li

63 of 66

Data Augmentation: Image Transformation

Source: Fei Fei Li

64 of 66

Data Augmentation: Random Crops and Scales

Source: Fei Fei Li

During training add random crops

Resize training images

Sample random patch

65 of 66

Data Augmentation: Color Changes

Source: Fei Fei Li

Randomize contrast and brightness

66 of 66

Thank you!