1 of 48

Lecture 10: �Deep Neural Networks

Applied Data Science Spring 2025

Amir Hesam Salavati

saloot@gmail.com

Hamed Shah-Mansouri

hamedsh@sharif.edu

2 of 48

https://ghabehfarda.ir/

3 of 48

Last Session We Covered...

Intro to Neural Networks

Some History

Feedforward Architectures

Recurrent Neural Networks

4 of 48

From Shallow to Deep Neural Networks

Image: https://laptrinhx.com/future-prospects-of-deep-learning-in-medicine-687217971/

5 of 48

Excellent and Fabulous Book to Read

Neural Networks and Deep Learning by Michael Nielsen�
Freely available online at�http://neuralnetworksanddeeplearning.com/ �
It is an interactive book! You can play with the algorithms!

6 of 48

Key Concepts

Training: use (stochastic) gradient descent with backpropagation�
Architecture: make things modular

have different parts perform different small tasks
then combine them at a higher layer�

Abundance of Data!

7 of 48

Image:https://medium.com/@buckhamduffymedia/understanding-the-relationship-between-artificial-intelligence-machine-learning-and-data-82cbaea388f5

‟AI is akin to building a rocket ship. The rocket engine is the learning algorithms but the fuel is the huge amounts of data we can feed to these algorithms〞

Andrew Ng

8 of 48

Some Fascinating Applications

Machine Translation

en.wikipedia.org/wiki/Google_Neural_Machine_Translation

Automatic Image Captioning

ai.google.com/research/ConceptualCaptions/

A photo of a woman sitting on a cloud in space, wearing an astronaut suit and a helmet. She is smiling and waving at the camera.

9 of 48

Some Fascinating Applications

Text Generation

https://bard.google.com/

A short paragraph on the power of ANNs:

شبکه های عصبی مصنوعی (ANNs) ابزارهای قدرتمندی هستند که در بسیاری از زمینه ها، از جمله تشخیص چهره، ترجمه زبان و پردازش زبان طبیعی، مورد استفاده قرار می گیرند. ANNs از مغز انسان الگوبرداری شده اند و از مجموعه‌ای از گره‌های متصل تشکیل شده‌اند که می‌توانند اطلاعات را پردازش کنند و یاد بگیرند. ANNs توانایی یادگیری از داده ها و ایجاد الگوهای پیچیده را دارند، که آنها را برای طیف گسترده ای از وظایف مفید می کند.

در سال‌های اخیر، ANNs پیشرفت‌های چشمگیری داشته‌اند و اکنون در بسیاری از محصولات و خدماتی که ما هر روز استفاده می‌کنیم، استفاده می‌شوند. به عنوان مثال، ANNs برای شناسایی چهره در تلفن‌های هوشمند، ترجمه متن در برنامه‌های ترجمه و تشخیص کلمات در نرم‌افزار تشخیص گفتار استفاده می‌شوند.

Image: https://arxiv.org/pdf/1912.04958.pdf

Image Generation using GANs

towardsdatascience.com/image-generation-in-10-minutes-with-generative-adversarial-networks-c2afc56bfa3b

10 of 48

Automatically taking care of a some of the cumbersome steps we had to perform manually

Feature extraction
Kernel selection
Feature selection

Image: https://quantdare.com/what-is-the-difference-between-deep-learning-and-machine-learning/

11 of 48

General Formulation

We have a vector of discrete numbers:

Goal: find a function to explain the target as best as you can,
How: run classification on the data (observations) that we have:

12 of 48

General Formulation

w would be the set of weights in the network, as well as nodes’ biases (i.e. firing thresholds)�
f is a function of both network structure and nodes’ activation function(s)

Goal: find a neural network to predict the target, as accurately as possible

13 of 48

Gradient Descent

Image: https://miro.medium.com/max/1262/1*v0VYQkVnTfMF5ptEnvAGSA.jpeg

14 of 48

Gradient Descent

Steps: for each data point

Calculate the gradient

Goal:

.�.�.

2. Update the weights:

Repeat the above steps several times (until convergence or max_itr)

15 of 48

Gradient Descent: Why Opposite the Gradient Direction?

Goal:

Step2: Update the weights:

Image:https://virgool.io/@danialfarsy/%D8%A8%D8%B1%D8%B1%D8%B3%DB%8C-%D9%88-%D9%85%D9%82%D8%A7%DB%8C%D8%B3%D9%87-batch-gradient-descentmini-batch-gradient-descentstochastic-gradient-descent-n4yklzivliiw

16 of 48

Gradient Descent: Learning Rate

Goal:

Step2: Update the weights:

Image:https://virgool.io/@danialfarsy/%D8%A8%D8%B1%D8%B1%D8%B3%DB%8C-%D9%88-%D9%85%D9%82%D8%A7%DB%8C%D8%B3%D9%87-batch-gradient-descentmini-batch-gradient-descentstochastic-gradient-descent-n4yklzivliiw

17 of 48

What Happens When Learning Rate is Too High/Low?

18 of 48

Gradient Descent: Properly Selecting Learning Rate

https://cs231n.github.io/neural-networks-3/

19 of 48

Gradient Descent: Batch Size

A large batch size usually means faster learning (in terms of computation time)

But requires more RAM

A very small batch size will result in very noisy gradient updates

There is a debate on whether this is bad or not!

16, 32, 64, etc. sound like a reasonable bet

Step2: Update the weights:

20 of 48

When Shall we Stop Training in Gradient Descent?

21 of 48

Gradient Descent: Stopping

Perform training:

Stop when:

Convergence: gradients are zero!
Training loss does not decrease much�(early stopping)
Validation loss starts increasing�(avoiding overfitting)
Maximum number of epochs reached

Image:https://researchgate.net/figure/Early-stopping-method_fig3_283697186

22 of 48

Image: towardsdatascience.com/how-does-back-propagation-in-artificial-neural-networks-work-c7cad873ea7

23 of 48

Historical Challenges of Gradient Descent in Neural Nets

Key step in training a neural network is to run gradient descent over all training samples.
This involves calculating the derivative of the objective function with respect to each weight
Calculating this derivative sequentially would be veryyy slow, as we might have millions of weights
A key factor would be to calculate gradients fast and efficiently.

24 of 48

Some Notations

.�.�.

Layer 2

Layer 4

Layer 1

.�.�.

Layer 3

25 of 48

Some Math

Based on the notations in the previous slide, we can write down the output of neurons in layer l as:

: the weight matrix from layer l-1 to layer l

: the output of layer l-1’s neurons (in vector form)

: the vectorized activation function, i.e.

26 of 48

Some Assumptions

Two necessary assumptions on the cost function:

It can be written as the sum of cost functions applied over each training sample, i.e:

2. It can be written as a function of the output layer only, i.e.

27 of 48

Backpropagation: Key Idea

The derivative of the cost function w.r.t. Weight at any given layer is proportional to:

Takeaways:

When the input neuron has low activation, learning would be slow.
When the output neuron is near saturation, learning would be slow.

28 of 48

Backpropagation Algorithm in a Nutshell

Forward pass: compute the following values for each layer

.�.�.

Layer 2

Layer 4

Layer 1

.�.�.

Layer 3

2. Backward pass (backpropagation): For each layer, calculate errors:

3. Return gradients and update we:

29 of 48

Backpropagation Relation to Gradient Descent?

30 of 48

Backpropagation and Gradient Descent

Gradient descent is the algorithm used to learn the weights (e.g. descending from the mountain)
Backpropagation is the technique used to calculate the gradients (e.g. where our next step should be)

Image: https://miro.medium.com/max/1262/1*v0VYQkVnTfMF5ptEnvAGSA.jpeg

31 of 48

Techniques to Improve the Learning of Deep Neural Nets

Image:employee-performance.com/blog/how-effective-performance-management-can-increase-companys-success/

32 of 48

Importance of Activation Function

From this key equation, we can see how activation function can affect learning:

When the input neuron has low activation, learning would be slow.
When the output neuron is near saturation, learning would be slow.

An activation function that does not saturate easily or is not very low is ideal

33 of 48

Activation Functions

Binary

Linear

Images are from https:/v7labs.com/blog/neural-networks-activation-functions

Pros: simple

Cons: always saturated

Pros: never saturates

Cons: gradient is always 1

34 of 48

Better Activation Functions

Sigmoid/Tanh

ReLU (Rectified Linear Unit)

Images are from https:/v7labs.com/blog/neural-networks-activation-functions

Pros: nonlinear and simple

Cons: rapidly saturates

Pros: not easily saturates

Cons: don’t exactly know why it works well in practice!!

35 of 48

Other Famous Activation Functions

Leaky ReLU

Parametric ReLU

Images are from https:/v7labs.com/blog/neural-networks-activation-functions

Exponentially Linear Unit (ELU)

36 of 48

Importance of Suitable Objective Function

Objective function should be meaningful to the problem
But more than that, it can also affect the speed of learning!
For instance, when the activation function is sigmoid and the loss function is MSE, the learning is very slow (because of the saturation)

Cross entropy solves this problem
Interactive example: neuralnetworksanddeeplearning.com/chap3.html#the_cross-entropy_cost_function �

So we need to choose our loss function and activation functions properly

37 of 48

Weight Initialization

When starting the training process, how shall we initialize the weights?

All zero? (No!)
Small, random Gaussian weights
Train shallow networks, use weights to initialize next layers
Xavier initialization: select weights uniformly from:

Or maybe just use a modified version of ReLU for activation:)
See here for some more interesting ideas:�http://deepdish.io/2015/02/24/network-initialization/

38 of 48

Importance of Weight Initialization

The vanishing/unstable gradient problem.
Activation function is also responsible.
For an excellent explanation see:�http://neuralnetworksanddeeplearning.com/chap5.html#the_vanishing_gradient_problem

39 of 48

Regularization and Overfitting

Traditional regularizers on weights in each layer

L1-norm regularizer
L2-norm regularizer

Dropout: modify the network architecture!
Validation data: reserver part of the data during training to check for overfitting
Manually expanding the dataset: adding noise/rotation/etc.

40 of 48

Dropout

.�.�.

Randomly “drop” (ignore) neurons when training

41 of 48

How Does “Dropout” Reduce Overfitting?

42 of 48

Dropout

.�.�.

Forces neurons to learn more robust features
Works somewhat similar to ensemble learning

43 of 48

K-Fold Cross Validation

Measure/tune the performance of the model on unseen data (validation set)

Image: https://towardsdatascience.com/cross-validation-k-fold-vs-monte-carlo-e54df2fc179b

44 of 48

Expanding the Dataset

Add new data points by modifying current data points

Image:https://researchgate.net/publication/319413978/figure/fig2/AS:533727585333249@1504261980375/Data-augmentation-using-semantic-preserving-transformation-for-SBIR.png

45 of 48

Expanding the Dataset

Add new data points by modifying current data points

Image:https://medium.com/secure-and-private-ai-writing-challenge/data-augmentation-increases-accuracy-of-your-model-but-how-aa1913468722

46 of 48

Examples of Deep Neural Networks

Gradient Descent & Backpropagation

Performance improvement techniques

47 of 48

https://redbubble.com/i/sticker/data-scientist-deep-learning-joke-by-dataninja/53395629.EJUG5

48 of 48

ToDo List for Next Session

Checkout the Google Colab notebook before our lab session�https://a4re.ir/lab10 �
Don’t forget the survey :)�https://forms.gle/FTj4nQW2inAfuxys6 �
Submit your homework on neural networks by�Friday 12 Ordibehesht, 23:59 Tehran Time