1 of 18

Transformers / LLMs

Ibrahim Alabdulmohsin and Mehdi Bennani

Confidential - Google DeepMind

2 of 18

Outline

Setting Things Up	05	mins
Basics	20	mins
Text Tokenization	20	mins
Encoder-Only: Text Classification	30	mins
Encoder-Decoder: Sequence Generation	40	mins
Wrap-up	05	mins

Confidential - Google DeepMind

3 of 18

Setting Things Up

Confidential - Google DeepMind

4 of 18

Instructions

You should have received a link:

Use the “solved” version.
If not: https://shorturl.at/B4y4t

Click on

Select GPU runtime and connect.

Confidential - Google DeepMind

5 of 18

General Guidelines

Active Participation:

hands-on session designed for you to actively engage with the material. Don't just read through the Colab – run the code, experiment with it, and try to understand how it works.

Challenge Yourself:

Try to solve the embedded questions on your own before looking at the solutions.
Remember, there's often more than one way to solve a problem.

Embrace Experimentation:

Play with the code, tweak parameters, change architectures, and see how your modifications affect the results.

Consult the documentation:

We provide context and background that explain the code throughout the colab.

Don't Be Afraid to Ask:

We have many volunteers here to help.

Confidential - Google DeepMind

6 of 18

Basics

Confidential - Google DeepMind

7 of 18

Background

What is PyTorch?

A widely used, open-source machine learning framework.
Known for flexibility and ease of use.

Why PyTorch?

Dynamic computation graph for intuitive model building and debugging.
GPU acceleration for faster training and inference.
Pythonic and easy to learn.
Large and active community with ample resources and support.

But, keep in mind that other frameworks exist that are also quite popular:

e.g. JAX

Confidential - Google DeepMind

8 of 18

Learning Objectives

Implement a Multilayer Perceptron (MLP) architecture with NumPy and PyTorch.
Compare performance and explore PyTorch's JIT compilation for speedup.
Understand and utilize PyTorch's automatic differentiation capabilities.
Implement a basic training loop to optimize the MLP model.

(Bonus): Experiment with weight decay, learning rates, and network depth.

Confidential - Google DeepMind

9 of 18

Encoder-Only:

Text Classification

Confidential - Google DeepMind

10 of 18

Background

What is a Transformer?

A powerful neural network architecture for processing sequential data.

Key components of a Transformer:

Self-attention mechanism for capturing relationships between tokens.
Positional embeddings for incorporating token order information.
Layer normalization and MLP layers.
Pooling or CLS for classification.

Confidential - Google DeepMind

11 of 18

Learning Objectives

Implement a simple Transformer model with a single self-attention layer.
Add MLP layers to the model.
Incorporate positional embeddings to capture token order.
Experiment with normalization layers and deeper architectures.
Evaluate the model on the "difference dataset" and analyze results.

Image Credit: Ogunfowora, O. and Najjaran, H., 2023. arXiv preprint arXiv:2308.09884.

Confidential - Google DeepMind

12 of 18

Encoder-Decoder: Sequence Generation

Confidential - Google DeepMind

13 of 18

Background

How is a Transformer used for sequence generation?

The encoder processes the input sequence.
The decoder generates the output sequence autoregressively, one token at a time.
Cross-attention allows the decoder to attend to the encoder output.
Causal masking prevents the decoder from “peeking” at future tokens.

Key concepts to remember:

Teacher forcing for training the decoder to predict the next token.
Perplexity as a metric for evaluating sequence generation models.

Confidential - Google DeepMind

14 of 18

Learning Objectives

Create an “addition dataset” for sequence generation practice.

Numbers are represented in a reverse order; e.g. 123 is represented as “321”.
End-of-sequence (EOS) token matters here!

Implement cross-attention and causal masking in the decoder.
Build an encoder-decoder Transformer model for autoregressive generation.
Evaluate the model on the "addition dataset" and analyze results.

Bonus: Experiment with different positional encoding techniques (e.g., sinusoidal).

Confidential - Google DeepMind

15 of 18

Wrap-up

Confidential - Google DeepMind

16 of 18

Summary

Key takeaways:

Tokenization is a crucial step for preparing text data for machine learning.
Transformers are versatile architectures for various tasks, including classification and sequence generation.
Self-attention, cross-attention, and causal masking are key mechanisms in Transformers.
Positional embeddings provide crucial information about token order.
Experimentation and understanding the impact of different components are essential for building effective Transformer models.

Confidential - Google DeepMind

17 of 18

Summary

Future steps:

Explore more advanced Transformer concepts like multi-headed attention and different positional encoding techniques.
Apply Transformers to real-world text datasets and tasks.
Implement vision transformers (ViT).

And of course, … continue learning and practicing to master the art of building and deploying capable Transformer models.

Confidential - Google DeepMind

1 of 18

2 of 18

3 of 18

4 of 18

5 of 18

6 of 18

7 of 18

8 of 18

9 of 18

10 of 18

11 of 18

12 of 18

13 of 18

14 of 18

15 of 18

16 of 18

17 of 18

18 of 18