1 of 57

UNIT -4

Recurrent Neural Networks- Back propagation through time,
Long Short Term Memory,
Gated Recurrent Units,
Bidirectional LSTMs,
Bidirectional RNNs.
Convolutional Neural Networks: LeNet, AlexNet.
Generative models: Restrictive Boltzmann Machines (RBMs),
Introduction to MCMC and Gibbs Sampling
gradient computations in RBMs,
Deep Boltzmann Machines.

2 of 57

Recurrent Neural Networks

A Recurrent Neural Network (RNN) is a deep learning architecture designed to process sequential or time-series data by retaining information from previous steps through a hidden state.
Stock Price Prediction
Language Translation,
Sentiment Analysis
Image Captioning
Speech Recognition
Time-series forecasting.

3 of 57

Recurrent Neural Network (RNN) Cell Architecture

7 of 57

Types of Recurrent Neural Networks

13 of 57

Mathematical Recurrent Neural Networks

17 of 57

Back Propagation Through Time(BPTT)

18 of 57

In BPTT, gradients are backpropagated through each time step. This is essential for updating network parameters based on temporal dependencies.

23 of 57

�1. Vanishing Gradient Problem�

Definition:

The Vanishing Gradient Problem occurs when gradients become very small (close to zero) during backpropagation.

Why it happens in RNN:

In RNNs, gradients are multiplied many times across time steps.
If the weights are small (<1), the gradients shrink repeatedly.

Effects:

Earlier layers (or earlier time steps) learn very slowly.
The network cannot remember long-term dependencies.
Training becomes ineffective.

Example:

�If gradient = 0.5 and it is multiplied through many steps:�0.5 × 0.5 × 0.5 × 0.5 → becomes almost 0

24 of 57

�2. Exploding Gradient Problem�

Definition:�The Exploding Gradient Problem occurs when gradients become very large during backpropagation.

Why it happens in RNN:

When weights are greater than 1, gradients grow exponentially during backpropagation.

Effects:

Model parameters become extremely large.
Training becomes unstable.
The network may produce NaN or infinite values.

Example:

�If gradient = 2 and multiplied repeatedly:�2 × 2 × 2 × 2 × 2 → becomes very large

26 of 57

Variants of Recurrent Neural Networks (RNNs)�

Long Short-Term Memory
Gated Recurrent Units
Bidirectional LSTMs
Bidirectional RNNs.

27 of 57

�Long Short-Term Memory �

Long Short-Term Memory (LSTM) is a special type of Recurrent Neural Network (RNN) designed to learn long-term dependencies in sequential data.
Traditional RNNs suffer from the vanishing gradient problem, which makes it difficult to remember information for long time steps.
LSTM overcomes this issue by introducing a memory cell and gating mechanisms that control the flow of information.

28 of 57

�LSTM Architecture�

LSTM architectures involves the memory cell which is controlled by three gates:

Forget gate: Determines what information is removed from the memory cell.
Input gate: Controls what information is added to the memory cell.
Output gate: Controls what information is output from the memory cell.

This allows LSTM networks to selectively retain or discard information as it flows through the network which allows them to learn long-term dependencies.
The network has a hidden state which is like its short-term memory.
This memory is updated using the current input, the previous hidden state and the current state of the memory cell.

30 of 57

�1. Forget Gate�

32 of 57

2. Input Gate

33 of 57

The equation for the input gate is:

34 of 57

3. Output gate

36 of 57

�Applications of LSTM�

LSTM is widely used in many real-world applications such as:
Natural Language Processing (NLP)
Speech Recognition
Machine Translation
Time Series Prediction
Handwriting Recognition

37 of 57

�Gated Recurrent Units�

Gated Recurrent Unit (GRU) is a type of Recurrent Neural Network (RNN) introduced to improve the performance of traditional RNNs and solve the vanishing gradient problem.
GRU is similar to Long Short-Term Memory (LSTM) but has a simpler architecture with fewer gates, which makes it faster to train.
GRU combines the cell state and hidden state into a single state, reducing complexity while still capturing long-term dependencies in sequential data.

39 of 57

��The GRU consists of two main gates:�

40 of 57

2.Reset Gate (rₜ) :

41 of 57

3. Candidate Hidden State

42 of 57

4. Final Hidden State

43 of 57

Difference Between LSTM or GRU

44 of 57

Bidirectional LSTMs

Bidirectional Long Short-Term Memory (BiLSTM) is an extension of traditional LSTM network.
Unlike conventional Long Short-Term Memory (LSTM) that process sequences in only one direction, BiLSTMs allow information to flow from both forward and backward enabling them to capture more contextual information.
This makes BiLSTMs particularly effective for tasks where understanding both past and future context is crucial.

45 of 57

A Bidirectional LSTM (BiLSTM) consists of two separate LSTM layers:

Forward LSTM: Processes the sequence from start to end
Backward LSTM: Processes the sequence from end to start
The outputs of both LSTMs are then combined to form the final output.

46 of 57

Mathematically, the final output at time t is computed as:

�

48 of 57

Bidirectional RNNs

Bidirectional Recurrent Neural Networks (BRNNs) are an extension of traditional RNNs designed to process sequential data in both forward and backward directions.
This architecture allows the model to utilize both past and future context, making it particularly effective for tasks where understanding the entire sequence is crucial.

49 of 57

Like a traditional RNN, a BRNN moves forward through the sequence, updating the hidden state based on the current input and the prior hidden state at each time step.
The key difference is that a BRNN also has a backward hidden layer which processes the sequence in reverse, updating the hidden state based on the current input and the hidden state of the next time step.
Compared to unidirectional RNNs BRNNs improve accuracy by considering both the past and future context.
This is because the two hidden layers i.e forward and backward complement each other and predictions are made using the combined outputs of both layers.

50 of 57

Example:

Consider the sentence: "I like apple. It is very healthy.“

In a traditional unidirectional RNN the network might struggle to understand whether "apple" refers to the fruit or the company based on the first sentence. However, a BRNN would have no such issue.

By processing the sentence in both directions, it can easily understand that "apple" refers to the fruit, thanks to the future context provided by the second sentence ("It is very healthy.").

52 of 57

1. Inputting a Sequence: A sequence of data points each represented as a vector with the same dimensionality is fed into the BRNN. The sequence may have varying lengths.

2. Dual Processing: BRNNs process data in two directions:

Forward direction: The hidden state at each time step is determined by the current input and the previous hidden state.
Backward direction: The hidden state at each time step is influenced by the current input and the next hidden state.

53 of 57

3. Computing the Hidden State: A non-linear activation function is applied to the weighted sum of the input and the previous hidden state creating a memory mechanism that allows the network to retain information from earlier steps.

4. Determining the Output: A non-linear activation function is applied to the weighted sum of the hidden state and output weights to compute the output at each step. This output can either be:

The final output of the network.
An input to another layer for further processing.

54 of 57

Convolutional Neural Network (CNN)

A Convolutional Neural Network (CNN) is a deep learning architecture designed for processing grid-like data such as images, videos, or audio spectrograms.
It uses convolution layers, activation functions, pooling layers, and fully connected layers to extract and learn hierarchical features from images.
CNN automatically detects patterns like edges, textures, and shapes. Due to its ability to learn spatial features, CNN is widely used in image classification, object detection, facial recognition, and medical image analysis.
Popular CNN models include LeNet and AlexNet.

55 of 57

CNNs are composed of two main parts:

Feature extractor: Stacks of convolutional, activation, and pooling layers that learn hierarchical features.
Classifier: Fully connected layers that map extracted features to output classes.

56 of 57

Components:

Convolutional Layer :Applies learnable filters (kernels) over the input to produce feature maps. Parameters: filter size (e.g., 3×3), stride, padding. Uses parameter sharing to reduce complexity. Output depth = number of filters.
Activation Function: Introduces non-linearity (commonly ReLU: max(0, x)).
Pooling Layer :Down samples feature maps to reduce spatial dimensions and overfitting risk. Max pooling (most common) or average pooling.
Flattening : Converts 2D feature maps into a 1D vector for dense layers.
Fully Connected Layers :Perform classification or regression. Often end with SoftMax for multi-class probability outputs.

57 of 57

�Applications�

Image classification (e.g., ImageNet).
Object detection (YOLO, Faster R-CNN).
Medical imaging (tumour detection).
Autonomous driving (lane detection).
Video analysis and facial recognition.

1 of 57

2 of 57

3 of 57

4 of 57

5 of 57

6 of 57

7 of 57

8 of 57

9 of 57

10 of 57

11 of 57

12 of 57

13 of 57

14 of 57

15 of 57

16 of 57

17 of 57

18 of 57

19 of 57

20 of 57

21 of 57

22 of 57

23 of 57

24 of 57

25 of 57

26 of 57

27 of 57

28 of 57

29 of 57

30 of 57

31 of 57

32 of 57

33 of 57

34 of 57

35 of 57

36 of 57

37 of 57

38 of 57

39 of 57

40 of 57

41 of 57

42 of 57

43 of 57

44 of 57

45 of 57

46 of 57

47 of 57

48 of 57

49 of 57

50 of 57

51 of 57

52 of 57

53 of 57

54 of 57

55 of 57

56 of 57

57 of 57