1 of 32

Transformer Architectures

Marc Ratkovic

Chair of Social Data Science

Professor of Political Science and Data Science

University of Mannheim

2 of 32

Agenda

Introduction to Transformers
Self-Attention Mechanism
Mathematics Behind Self-Attention
Transformer Architecture
Transformer Blocks
Bidirectional vs. Causal Models
ADAM Optimizer
Low Rank Optimization (LORA)
Fine-Tuning of Bidirectional Models

3 of 32

Reconciling Modern Machine Learning Practice and the Classical Bias-Variance Trade-Off

Understanding this relationship helps shape the design and understanding of learning algorithms, guiding practitioners in model selection.

The findings show how powerful modern classifiers, which fit training data exactly, can still generalize well to unseen data.

This research highlights the emergence of a 'double-descent' risk curve, which challenges the classical U-shaped bias-variance trade-off.

Reconciling classical theory with modern machine learning practices addresses the disconnect between traditional concepts and contemporary methods.

4 of 32

Classical vs. Modern Practice

Classical Machine Learning Practice

Modern Machine Learning Practice

Focus on the bias-variance trade-off, which suggests finding a balance between underfitting and overfitting.
Employ simpler models to avoid fitting spurious patterns, leading to a U-shaped risk curve.
Models are designed to generalize well, minimizing empirical risk while managing true risk.

Utilizes rich models, such as neural networks, that are trained to exactly fit the data, often interpolating training points.
Classically, models that achieve zero training error would be deemed overfitted; however, they often demonstrate high accuracy on test data.
These practices challenge traditional views, as modern methods allow for large function classes that can achieve lower test risks despite fitting training data perfectly.

5 of 32

Double-Descent Risk Curve

Understanding the Double-Descent Phenomenon

The double-descent risk curve reconciles classical and modern machine learning practices, extending the traditional U-shaped bias-variance trade-off curve.
It illustrates how increasing model capacity beyond the interpolation threshold can lead to improved performance, showing a second descent in risk.
Classically, the bias-variance trade-off suggests that a model should balance between underfitting and overfitting, but modern practices challenge this notion by demonstrating high accuracy even with zero training error.
The double-descent curve captures this behavior, indicating that richer models, like neural networks, can perform well despite fitting training data perfectly.

8 of 32

Empirical Evidence

For instance, neural networks and kernel machines trained to interpolate the training data yield near-optimal test results even with high noise levels in training data.

The empirical results demonstrate that increasing model capacity beyond the interpolation threshold can improve generalization performance, contrary to classical expectations.

Evidence shows that the double-descent behavior is observed across various models and datasets, indicating its ubiquity in machine learning applications.

9 of 32

Neural Networks and Double-Descent

Understanding Double-Descent in Neural Networks

When function class capacity is below the "interpolation threshold," learned predictors exhibit the classical U-shaped curve.
Predictors to the right of the interpolation threshold achieve zero training risk, yet they can have improved test performance due to increased function class capacity.
Increasing the function class capacity beyond the interpolation point leads to decreasing risk, often below that achieved at the sweet spot in the classical regime.
Neural networks trained to interpolate the training data can obtain near-optimal test results, even when training data is corrupted with high levels of noise.

10 of 32

Implications for Machine Learning Theory

This understanding may guide future research in machine learning to explore the interplay between model complexity and generalization more deeply.

The double-descent curve reveals that increasing model capacity beyond the interpolation point leads to improved performance, challenging classical views of overfitting.

It suggests that conventional wisdom regarding the bias-variance trade-off needs revision, particularly in selecting models for generalization.

Practitioners should consider the inductive biases of learning algorithms, as richer function classes can yield better test performance despite high training accuracy.

12 of 32

Attention Mechanism

13 of 32

Introduction to Transformers

Transformers in Sequence-to-Sequence Tasks

Transformers utilize self-attention mechanisms to efficiently process sequences, allowing the model to consider the entire context at once.
This innovation enables parallel processing of data, significantly speeding up training times compared to sequential models.
Traditional models, like RNNs, must process data sequentially, which limits their ability to leverage context from distant positions in the sequence.
The self-attention mechanism allows transformers to weigh the importance of each part of the input sequence dynamically, enhancing their understanding of context.

14 of 32

Self-Attention Mechanism

Response Calculation

Weighted Average

Weights: Q, K, V

Self-attention computes the response at a given position in a sequence by attending to all other positions. This is achieved by taking a weighted average of all positions, allowing the model to focus on relevant context.

The response for each position is determined by a weighted average of the input representations. The weights reflect the importance of other positions in relation to the current position, enhancing the model's contextual understanding.

Three sets of weights are utilized: Queries (Q), Keys (K), and Values (V). These weights are learned during training and are crucial for transforming input data into the attention scores, directing the model's focus.

15 of 32

Mathematics Behind Self-Attention

Self-Attention Formula

Definitions

Role of Keys

Role of Values

Role of Queries

Importance of Softmax

The formula for self-attention is given by:

Attention(Q, K, V) = softmax((QK^T) / √(d_k))V.

Where:

- Q represents the Query matrix.

- K represents the Key matrix.

- V represents the Value matrix.

- d_k is the dimensionality of the key vectors.

Keys act as a reference for the Queries. They help to identify which parts of the input should be focused on.

Values carry the actual information that is being attended to. The final output is a weighted combination of the Values.

Queries are used to extract information from the Keys. The similarity between Queries and Keys determines the attention weights.

The softmax function is crucial as it converts the attention scores into probabilities, ensuring that they sum to one and can be interpreted as attention weights.

18 of 32

Transformer Architecture

Encoder and Decoder

Multi-Head Self-Attention

Feed-Forward Networks

Normalization and Residual Connections

Transformers consist of two main components: the encoder and the decoder. The encoder processes the input sequence, while the decoder generates the output sequence.

Each layer includes a multi-head self-attention mechanism, allowing the model to focus on different parts of the input sequence simultaneously, capturing various contextual relationships.

Following the self-attention mechanism, each layer contains a position-wise fully connected feed-forward network that applies the same transformation independently to each position.

Layer normalization is applied to stabilize the learning process, and residual connections allow gradients to flow more easily during training, enhancing model convergence.

19 of 32

Transformer Blocks

Definition of Transformer Blocks

Role in Encoder and Decoder

Independent Processing

Transformer blocks are the fundamental building units within the encoder and decoder of a transformer model. Each block contains layers that apply self-attention and feed-forward neural networks.

In the encoder, blocks process input sequences to create a rich representation. In the decoder, blocks use these representations to generate outputs, incorporating attention to previously generated tokens.

Each transformer block processes input data independently, allowing the model to learn complex dependencies across the entire sequence without the constraints of sequential data processing.

22 of 32

A rather amazing tutorial

https://poloclub.github.io/transformer-explainer/

23 of 32

Bidirectional vs. Causal Models

Bidirectional Models (e.g., BERT)

Causal Models (e.g., GPT)

Process data from both directions, allowing context to be derived from surrounding words.
Utilize the entire sequence for understanding, making them effective for tasks like sentiment analysis.
Suitable for tasks where context from both ends is crucial, such as question answering or language inference.

Process data in a unidirectional manner, from past to future, reflecting a natural progression of language.
Focus on generating text, making them ideal for applications like text completion and chatbots.
Limitations in understanding future context can affect performance in tasks that require full context.

26 of 32

Low Rank Optimization (LORA)

Concept of LORA

Parameter Reduction

Benefits of LORA

Low Rank Optimization (LORA) is a technique designed to reduce the number of parameters in large machine learning models. It is based on the principle of approximating weight matrices using lower-dimensional representations.

By decomposing weight matrices into low-rank approximations, LORA significantly decreases the number of parameters that need to be stored and optimized. This reduction helps in managing computational resources more effectively.

The primary benefits of LORA include improved optimization speed and enhanced memory efficiency. This allows for faster training times and reduces the hardware requirements for deploying large models.

29 of 32

ADAM Optimizer

Introduction to ADAM

Key Features

Formulas for ADAM

Parameter Updates

ADAM (Adaptive Moment Estimation) is an optimization algorithm designed to compute adaptive learning rates for each parameter, improving efficiency in training machine learning models.

ADAM combines the advantages of two other methods: AdaGrad and RMSProp. It maintains a separate adaptive learning rate for each parameter, which allows for faster convergence.

The key equations for ADAM are as follows, where g_t is the gradient at time t and \beta_1, \beta_2 are decay rates.

The update rule for parameters is given by: \theta_{t+1} = \theta_t - \frac{\eta \cdot \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}.

32 of 32

Fine-Tuning of Bidirectional Models

Introduction to Fine-Tuning

Process Overview

Applications in NLP

Benefits of Fine-Tuning

Fine-tuning involves adjusting pre-trained models to enhance performance on specific tasks. This process leverages the knowledge gained during pre-training.

The fine-tuning process typically includes additional training on a smaller, task-specific dataset. This allows the model to adapt its knowledge to the nuances of the new data.

Bidirectional models like BERT are particularly effective for tasks such as sentence classification, named entity recognition, and sentiment analysis.

Fine-tuning enables models to achieve high accuracy with fewer training examples. This is crucial in scenarios where labeled data is scarce.

1 of 32

2 of 32

3 of 32

4 of 32

5 of 32

6 of 32

7 of 32

8 of 32

9 of 32

10 of 32

11 of 32

12 of 32

13 of 32

14 of 32

15 of 32

16 of 32

17 of 32

18 of 32

19 of 32

20 of 32

21 of 32

22 of 32

23 of 32

24 of 32

25 of 32

26 of 32

27 of 32

28 of 32

29 of 32

30 of 32

31 of 32

32 of 32