1 of 18

Transformers / LLMs

Ibrahim Alabdulmohsin and Mehdi Bennani

Confidential - Google DeepMind

2 of 18

Outline

Setting Things Up

05

mins

Basics

20

mins

Text Tokenization

20

mins

Encoder-Only: Text Classification

30

mins

Encoder-Decoder: Sequence Generation

40

mins

Wrap-up

05

mins

Confidential - Google DeepMind

3 of 18

Setting Things Up

1

Confidential - Google DeepMind

4 of 18

Instructions

  1. You should have received a link:
    1. Use the “solved” version.
    2. If not: https://shorturl.at/B4y4t

  1. Click on

  1. Select GPU runtime and connect.

Confidential - Google DeepMind

5 of 18

General Guidelines

  1. Active Participation:
    1. hands-on session designed for you to actively engage with the material. Don't just read through the Colab – run the code, experiment with it, and try to understand how it works.

  1. Challenge Yourself:
    1. Try to solve the embedded questions on your own before looking at the solutions.
    2. Remember, there's often more than one way to solve a problem.

  1. Embrace Experimentation:
    1. Play with the code, tweak parameters, change architectures, and see how your modifications affect the results.

  1. Consult the documentation:
    1. We provide context and background that explain the code throughout the colab.

  1. Don't Be Afraid to Ask:
    1. We have many volunteers here to help.

Confidential - Google DeepMind

6 of 18

Basics

2

Confidential - Google DeepMind

7 of 18

Background

  1. What is PyTorch?
    1. A widely used, open-source machine learning framework.
    2. Known for flexibility and ease of use.

  1. Why PyTorch?
    1. Dynamic computation graph for intuitive model building and debugging.
    2. GPU acceleration for faster training and inference.
    3. Pythonic and easy to learn.
    4. Large and active community with ample resources and support.

  1. But, keep in mind that other frameworks exist that are also quite popular:
    1. e.g. JAX

Confidential - Google DeepMind

8 of 18

Learning Objectives

  1. Implement a Multilayer Perceptron (MLP) architecture with NumPy and PyTorch.
  2. Compare performance and explore PyTorch's JIT compilation for speedup.
  3. Understand and utilize PyTorch's automatic differentiation capabilities.
  4. Implement a basic training loop to optimize the MLP model.

  1. (Bonus): Experiment with weight decay, learning rates, and network depth.

Confidential - Google DeepMind

9 of 18

Encoder-Only:

Text Classification

3

Confidential - Google DeepMind

10 of 18

Background

  1. What is a Transformer?
    1. A powerful neural network architecture for processing sequential data.

  1. Key components of a Transformer:
    1. Self-attention mechanism for capturing relationships between tokens.
    2. Positional embeddings for incorporating token order information.
    3. Layer normalization and MLP layers.
    4. Pooling or CLS for classification.

Confidential - Google DeepMind

11 of 18

Learning Objectives

  1. Implement a simple Transformer model with a single self-attention layer.
  2. Add MLP layers to the model.
  3. Incorporate positional embeddings to capture token order.
  4. Experiment with normalization layers and deeper architectures.
  5. Evaluate the model on the "difference dataset" and analyze results.

Image Credit: Ogunfowora, O. and Najjaran, H., 2023. arXiv preprint arXiv:2308.09884.

Confidential - Google DeepMind

12 of 18

Encoder-Decoder: Sequence Generation

4

Confidential - Google DeepMind

13 of 18

Background

  1. How is a Transformer used for sequence generation?
    1. The encoder processes the input sequence.
    2. The decoder generates the output sequence autoregressively, one token at a time.
    3. Cross-attention allows the decoder to attend to the encoder output.
    4. Causal masking prevents the decoder from “peeking” at future tokens.

  1. Key concepts to remember:
    1. Teacher forcing for training the decoder to predict the next token.
    2. Perplexity as a metric for evaluating sequence generation models.

Confidential - Google DeepMind

14 of 18

Learning Objectives

  1. Create an “addition dataset” for sequence generation practice.
    1. Numbers are represented in a reverse order; e.g. 123 is represented as “321”.
    2. End-of-sequence (EOS) token matters here!

  1. Implement cross-attention and causal masking in the decoder.
  2. Build an encoder-decoder Transformer model for autoregressive generation.
  3. Evaluate the model on the "addition dataset" and analyze results.

  1. Bonus: Experiment with different positional encoding techniques (e.g., sinusoidal).

Confidential - Google DeepMind

15 of 18

Wrap-up

5

Confidential - Google DeepMind

16 of 18

Summary

Key takeaways:

    • Tokenization is a crucial step for preparing text data for machine learning.
    • Transformers are versatile architectures for various tasks, including classification and sequence generation.
    • Self-attention, cross-attention, and causal masking are key mechanisms in Transformers.
    • Positional embeddings provide crucial information about token order.
    • Experimentation and understanding the impact of different components are essential for building effective Transformer models.

Confidential - Google DeepMind

17 of 18

Summary

Future steps:

    • Explore more advanced Transformer concepts like multi-headed attention and different positional encoding techniques.
    • Apply Transformers to real-world text datasets and tasks.
    • Implement vision transformers (ViT).

    • And of course, … continue learning and practicing to master the art of building and deploying capable Transformer models.

Confidential - Google DeepMind

18 of 18

Thank you.

Confidential - Google DeepMind