1 of 21

UNIT 5

  • Recent Trends:Variational Auto encoders
  • Transformers
  • GPT Applications: Vision, NLP, Speech

2 of 21

Variational Auto encoders

  • Variational Autoencoders (VAEs) are generative models that learn a smooth, probabilistic latent space, allowing them not only to compress and reconstruct data but also to generate entirely new, realistic samples. VAEs capture the underlying structure of a dataset and produce outputs that closely resemble the original data.
  • Learns a continuous latent representation
  • Enables controlled and meaningful data generation
  • Widely used in image synthesis, anomaly detection, and representation learning
  • Learns a probabilistic latent space (mean & variance).

3 of 21

Architecture of Variational Autoencoder

4 of 21

Example

5 of 21

VAE is a special kind of autoencoder that can generate new data instead of just compressing and reconstructing it.

  • It has three main parts:

1. Encoder (Understanding the Input)

2. Latent Space (Adding Some Randomness)

3. Decoder (Reconstructing or Creating New Data)

6 of 21

1. Encoder (Understanding the Input)

The encoder takes input data like images or text and learns its key features. Instead of outputting one fixed value, it produces two vectors for each feature:

  • Mean (μ): A central value representing the data.
  • Standard Deviation (σ): It is a measure of how much the values can vary.

These two values define a range of possibilities instead of a single number.

7 of 21

��2. Latent Space (Adding Some Randomness)

  • Instead of encoding the input as one fixed point it pick a random point within the range given by the mean and standard deviation.
  • This randomness lets the model create slightly different versions of data which is useful for generating new, realistic samples.

z∼N(μ,σ2)

8 of 21

3. Decoder (Reconstructing or Creating New Data)

  • The decoder takes the random sample from the latent space and tries to reconstruct the original input.
  • Since the encoder gives a range, the decoder can produce new data that is similar but not identical to what it has seen.

9 of 21

Transformers

  • A Transformer is a deep learning architecture designed to handle sequential data such as text. It uses a mechanism called Self-Attention to understand relationships between words in a sentence. It was introduced in the paper Attention Is All You Need.
  • Introduced in the 2017 paper "Attention Is All You Need", they have become the backbone of state-of-the-art models like BERT, GPT, and Vision Transformers.
  • They excel at capturing long-range dependencies and processing sequences in parallel, making them faster and more scalable than RNNs or LSTMs.

10 of 21

11 of 21

12 of 21

Example

13 of 21

Components of Transformer

  • Self-Attention Mechanism:
  • Feed-Forward Neural Network:
  • Multi-Head Attention
  • Positional Encoding

14 of 21

��Architecture of Transformer��

The Transformer model consists of two main components:

1. Encoder

  • Takes input sequence (sentence)
  • Converts words into vector representations (embeddings)
  • Uses self-attention and feed-forward layers
  • Captures context of each word

Encoder consists of multiple layers and each layer is composed of two main sub-layers:

  1. Self-Attention Mechanism
  2. Feed-Forward Neural Network

15 of 21

���

2. Decoder

  • Generates output sequence
  • Uses encoder output + its own self-attention
  • Used in tasks like translation and text generation

Each decoder layer consists of three main sub-layers:

1.Masked Self-Attention Mechanism

2.Encoder-Decoder Attention Mechanism

3.Feed-Forward Neural Network

16 of 21

��How Transformers Work

1. Input Representation

The first step in processing input data involves converting raw text into a format that the transformer model can understand. This involves tokenization and embedding.

  • Tokenization: The input text is split into smaller units called tokens, which can be words, sub words or characters. Tokenization ensures that the text is broken down into manageable pieces.
  • Embedding: Each token is then converted into a fixed-size vector using an embedding layer. This layer maps each token to a dense vector representation that captures its semantic meaning.
  • Positional encodings: are added to these embeddings to provide information about the token positions within the sequence.

17 of 21

2. Encoder Process in Transformers

  • Input Embedding: The input sequence is tokenized and converted into embeddings with positional encodings added.
  • Self-Attention Mechanism: Each token in the input sequence attends to every other token to capture dependencies and contextual information.
  • Feed-Forward Network: The output from the self-attention mechanism is passed through a position-wise feed-forward network.
  • Layer Normalization and Residual Connections: Layer normalization and residual connections are applied.

18 of 21

3. Decoder Process

  • Input Embedding and Positional Encoding: The partially generated output sequence is tokenized and embedded with positional encodings added.
  • Masked Self-Attention Mechanism: The decoder uses masked self-attention to prevent attending to future tokens ensuring that the model generates the sequence step-by-step.
  • Encoder-Decoder Attention Mechanism: The decoder attends to the encoder's output allowing it to focus on relevant parts of the input sequence.
  • Feed-Forward Network: Similar to the encoder the output from the attention mechanisms is passed through a position-wise feed-forward network.
  • Layer Normalization and Residual Connections: Similar to the encoder Layer normalization and residual connections are applied.

19 of 21

4. Training and Inference

Training

  • Uses Teacher Forcing (correct previous tokens are given)
  • Predicts next word based on actual previous words
  • Uses loss function (e.g., cross-entropy) for learning
  • Faster training due to parallel processing
  • Based on Attention Is All You Need architecture

Inference

No teacher forcing (model uses its own predictions)

Generates output token by token (step-by-step)

  • Uses techniques like:
  • Greedy Search
  • Beam Search

20 of 21

  •  

21 of 21

��Applications

  • Machine Translation
  • Chatbots (e.g., ChatGPT)
  • Text Summarization
  • Question Answering
  • Speech Recognition