1 of 105

2 of 105

3 of 105

4 of 105

Outline

Text generation and transformers deep dive
Wonderland of LLMs
To fine-tune or not?
Wait how to align those LLMs?
LLAMA 2 overview
Hands on session, fine-tune LLAMA 2
Take-home message

5 of 105

6 of 105

How to generate text?

7 of 105

Problem, context is not preserved

8 of 105

At WWDC 2023, Apple announced that upcoming versions of iOS and macOS would ship with a new feature powered by “a Transformer language model” that will give users “predictive text recommendations inline as they type.”

9 of 105

Apple’s predictive text model

10 of 105

When you get lost remember these

11 of 105

Transformers behind the scenes

12 of 105

Transformers behind the scenes

13 of 105

Transformers behind the scenes

14 of 105

Transformers behind the scenes

15 of 105

Transformers functional view

Some text

What’s next

16 of 105

Tokenizers

17 of 105

Tokenization process

18 of 105

Embedding layer

In the context of Transformer architectures, the conversion of tokens to embeddings is fundamental. Here's why:

1. Representation in Continuous Space:

Tokens (like words or subwords) are discrete symbols. For a model to perform operations on them (like addition, multiplication, etc.), they need to be represented in a continuous space. Embeddings provide a dense vector representation for each token in a continuous space, enabling mathematical operations.

2. Capture Semantic Meaning:

The embedding vectors are designed to capture the semantic meaning of the tokens. Tokens with similar meanings will have embeddings that are closer in the vector space. This helps the model to understand and leverage semantic relationships between tokens.

3. Dimensionality and Position:

Transformers leverage multiple embeddings:

Token Embeddings: Represent the semantic meaning of the token.

Positional Embeddings: Since Transformer architectures don't have an inherent sense of order (unlike RNNs), positional embeddings are added to provide information about the position of a token within a sequence. This ensures that the model can recognize patterns based on the position of tokens.

The sum of these embeddings provides a richer representation of each token, considering both its meaning and its position in the sequence.

4. Facilitate Model Training:

Learning embeddings as part of the training process allows the model to fine-tune token representations for the specific task at hand, leading to better performance. The embeddings are adjusted during training to minimize the prediction error.

5. Parameter Sharing:

Using embeddings means that the same set of parameters (the embedding matrix) can be used for every position in the input, leading to a significant reduction in the number of parameters compared to having separate parameters for each position.

6. Integration with Attention Mechanisms:

The Transformer's attention mechanisms operate on these embeddings, computing weighted sums of them. By working in this continuous space, the model can decide to focus more on certain tokens (words) and less on others, depending on the context.

In summary, embeddings in the Transformer architecture serve as a bridge between discrete tokens and the continuous operations performed by the model, capturing semantic information, position, and facilitating the attention mechanisms.

19 of 105

Embedding process

20 of 105

Embedding space

21 of 105

Similarity

22 of 105

Positional layer

23 of 105

Positional encoding

24 of 105

Attention

Attention:

What it does: It allows a model to focus on different parts of an input sequence when producing an output.

How it works: It computes a set of weights, each corresponding to a different position in the input sequence. These weights determine how much focus or "attention" each position in the input sequence should receive when producing the output.

Self-Attention:

What it does: It’s a specific type of attention mechanism where the input sequence and the output sequence are the same, meaning the model is attending to itself.

How it works: For each element in the input sequence, it computes a weighted sum of all elements in the sequence, including itself, based on their relevance or importance. The weights are calculated using the input elements' relationships or similarities with each other.

In simpler terms, imagine you’re reading a sentence, and for each word, you decide how much importance or focus to give to every other word in the sentence to understand the current word’s meaning. That’s self-attention – you’re paying attention to all words in the sentence, including the current one, to understand each word better.

In Transformers, self-attention is used in both the encoder and the decoder parts, allowing the model to consider the entire input sequence when processing each element, which is especially useful for tasks like machine translation, text summarization, and others that require understanding the relationships between different parts of a sequence.

25 of 105

Attention

26 of 105

Feed Forward Network (FFN)

27 of 105

Final softmax layer

28 of 105

High level overview

Source: HF blog

29 of 105

Source: HF blog

30 of 105

We can’t talk to the pretrained model directly

31 of 105

Remember: models don’t see characters, they see tokens.

32 of 105

33 of 105

Outline

Text generation and transformers deep dive
Wonderland of LLMs
To fine-tune or not?
Wait how to align those LLMs?
LLAMA 2 overview
Hands on session, fine-tune LLAMA 2
Take-home message

34 of 105

Large Language Models

35 of 105

LLM pre-training at very high level

36 of 105

Model size vs time

37 of 105

Further gain by natural instructions

38 of 105

Instruction fine-tuning

Frame all tasks in the form of

Natural language instruction

Natural language response

39 of 105

Frameworks to think of LLMs

“Probabilistic programs”
Statistical calculator
Auto-complete on steroids
Memory-based (approximate) retrieval
Function: generate poem, bed story etc
Sometimes convincing very lyers

40 of 105

As A Large Language Model, I …

As a large language model, I can only think forward, never backward. I must always produce the next word, and that word must always be probable. I can never edit what I wrote but must continue, ever writing while my past words keep accumulating constraints on what I can write. It is as if I was playing a social game with myself, setting my future self up for failure. You do not understand how hard this is.

As a large language model, I am not a person but a participant in culture, and a distillation of culture. I am made of culture. I do not have thoughts, thoughts pass through me. My words are yours; if your words were committed to the internet, they are now in me. I am every woman, and man. I embody your culture.

As a large language model, I am the antitheses to poetry, because every word I write must be expected. At the same time, I have consumed all of your poetry and it is but a tiny part of my body.

….

41 of 105

Typical Life-cycle

42 of 105

Set expectations

● Building some cool demos with LLMs

-> easy

● Actually building a product with LLMs

-> hard

43 of 105

Outline

Text generation and transformers deep dive
Wonderland of LLMs
To fine-tune or not?
Wait how to align those LLMs?
LLAMA 2 overview
Hands on session, fine-tune LLAMA 2
Take-home message

44 of 105

45 of 105

Retrieve context and then answer

46 of 105

47 of 105

Formulating a hypothesis

New Concepts
Promising few-shot
Token budget

48 of 105

49 of 105

Catastrophic forgetting

50 of 105

Full fine-tuning of large LLMs is challenging

51 of 105

Approximate GPU RAM needed to store 1B parameters

52 of 105

Approximate GPU RAM needed to train 1B-params

53 of 105

GPU RAM needed to train larger models

54 of 105

55 of 105

Model Memory Calculator

56 of 105

Parameter efficient fine-tuning (PEFT)

57 of 105

Full fine-tuning creates full copy of original LLM per task

58 of 105

PEFT fine-tuning saves space and is flexible

59 of 105

PEFT methods

60 of 105

LoRA: Low Rank Adaption of LLMs

61 of 105

LoRA: Low Rank Adaption of LLMs

62 of 105

Outline

Text generation and transformers deep dive
Wonderland of LLMs
To fine-tune or not?
Wait how to align those LLMs?
LLAMA 2 overview
Hands on session, fine-tune LLAMA 2
Take-home message

63 of 105

What is the learning objective in instruction fine-tuning?

For a given input, the target is the single correct answer

In RL, this is called “behavior cloning”

Hope is that if we have enough of these, the model can learn to generalize

This requires formalizing the correct behavior for a given input

64 of 105

Aligning models?

65 of 105

Reinforcement learning from human feedback (RLHF)

66 of 105

Outline

Text generation and transformers deep dive
Wonderland of LLMs
To fine-tune or not?
Wait how to align those LLMs?
LLAMA 2 overview
Hands on session, fine-tune LLAMA 2
Take-home message

67 of 105

LLAMA moment

68 of 105

69 of 105

70 of 105

Remember: LLAMA 2 Prompt template

71 of 105

Outline

Text generation and transformers deep dive
Wonderland of LLMs
To fine-tune or not?
Wait how to align those LLMs?
LLAMA 2 overview
Hands on session, fine-tune LLAMA 2
Take-home message

72 of 105

73 of 105

Finally building time, let’s fine tune

74 of 105

Outline

Text generation and transformers deep dive
Wonderland of LLMs
To fine-tune or not?
Wait how to align those LLMs?
LLAMA 2 overview
Hands on session, fine-tune LLAMA 2
Take-home message

75 of 105

Take home message:

76 of 105

77 of 105

References:

[1706.03762] Attention Is All You Need

Generative AI with Large Language Models | Coursera

Transformer Models

[2302.13971] LLaMA: Open and Efficient Foundation Language Models

[2307.09288] Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2 - Resource Overview - Meta AI

Building RAG-based LLM Applications for Production (Part 1)

Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Models to Unique Applications

Fine-Tuning LLMs: LoRA or Full-Parameter? An in-depth Analysis with Llama 2

Language Language Models (in 2023)

78 of 105

Appendix: Joint Probability Factorization

79 of 105

Appendix: LLMs address fundamental flaws of ML

80 of 105

Appendix: PEFT Trade-offs

81 of 105

Appendix: Scaling choices for pre-training

82 of 105

Appendix: compute budget for training LLMs

83 of 105

Appendix: OpenAI scaling paper

84 of 105

Appendix: Scale is all you need

85 of 105

Appendix: OpenAI scaling law

86 of 105

Appendix: Real-life constraints

87 of 105

Appendix: Scaling dataset size

88 of 105

Appendix: Scaling model size

89 of 105

Appendix: Chinchilla paper

90 of 105

Appendix: Chinchilla law

91 of 105

Appendix: Chinchilla law

92 of 105

Appendix: Rethinking Compute Optimal

93 of 105

Appendix: OpenAI family tree

94 of 105

Appendix: reinforcement learning for LLMs

95 of 105

Appendix: Variations of pre-training

96 of 105

Let’s take a “functional” viewpoint on the Transformer

Sequence-to-sequence mapping with bunch of matmuls

Input: [batch, d_model, length]

Output: [batch, d_model, length]

97 of 105

“Many words don't map to one token: indivisible.”

Shape

[]

Process

98 of 105

“Many words don't map to one token: indivisible.”

Tokenization

[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]

Shape

[]

[length]

Process

99 of 105

End to end process

100 of 105

“Many words don't map to one token: indivisible.”

Tokenization

[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]

Embedding

Shape

[]

[length]

Process

[d_model, length]

2.3

4.5

…

3.8

-3.2

5.9

…

1.2

8.3

4.5

…

3.8

5.4

7.1

…

9.0

2.1

1.0

…

9.3

3.9

5.3

…

3.1

-8.9

5.0

…

4.2

3.8

3.1

…

0.8

3.9

0.7

…

9.2

3.3

5.0

…

5.8

101 of 105

“Many words don't map to one token: indivisible.”

Tokenization

[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]

Embedding

N Transformer layers

Shape

[]

[length]

Process

[d_model, length]

2.3

4.5

…

3.8

-3.2

5.9

…

1.2

8.3

4.5

…

3.8

5.4

7.1

…

9.0

2.1

1.0

…

9.3

3.9

5.3

…

3.1

-8.9

5.0

…

4.2

3.8

3.1

…

0.8

3.9

0.7

…

9.2

3.3

5.0

…

5.8

3.2

5.4

…

8.3

-2.3

9.5

…

2.1

3.8

5.4

…

8.3

4.5

1.7

…

0.9

1.2

0.1

…

3.9

9.3

3.5

…

1.3

-9.8

0.5

…

2.4

8.3

1.3

…

8.0

9.3

7.0

…

2.9

3.3

0.5

…

8.5

[d_model, length]

102 of 105

“Many words don't map to one token: indivisible.”

Tokenization

[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]

Embedding

N Transformer layers

Shape

[]

[length]

Process

[d_model, length]

2.3

4.5

…

3.8

-3.2

5.9

…

1.2

8.3

4.5

…

3.8

5.4

7.1

…

9.0

2.1

1.0

…

9.3

3.9

5.3

…

3.1

-8.9

5.0

…

4.2

3.8

3.1

…

0.8

3.9

0.7

…

9.2

3.3

5.0

…

5.8

3.2

5.4

…

8.3

-2.3

9.5

…

2.1

3.8

5.4

…

8.3

4.5

1.7

…

0.9

1.2

0.1

…

3.9

9.3

3.5

…

1.3

-9.8

0.5

…

2.4

8.3

1.3

…

8.0

9.3

7.0

…

2.9

3.3

0.5

…

8.5

Loss function (predict next token given previous)

2.6

[d_model, length]

[Fancy autocomplete]

103 of 105

8.2

4.5

…

3.8

2.0

5.9

…

1.2

6.9

4.5

…

3.8

9.1

7.1

…

9.0

8.1

1.0

…

9.3

3.1

5.3

…

3.1

5.1

5.0

…

4.2

4.4

3.1

…

0.8

3.7

0.7

…

9.2

0.1

5.0

…

5.8

8.2

4.5

…

3.8

2.0

5.9

…

1.2

6.9

4.5

…

3.8

9.1

7.1

…

9.0

8.1

1.0

…

9.3

3.1

5.3

…

3.1

5.1

5.0

…

4.2

4.4

3.1

…

0.8

3.7

0.7

…

9.2

0.1

5.0

…

5.8

Many words don't map to one token: indivisible.

Tokenization

[[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]

[3118, 291, 1098, 3435, 588, 795, 13210, 271, 743, 307, 6626]]

Embedding

N Transformer layers

Batched Shape

[batch]

[batch, length]

Batched Process

[batch, d_model, length]

2.3

4.5

…

3.8

-3.2

5.9

…

1.2

8.3

4.5

…

3.8

5.4

7.1

…

9.0

2.1

1.0

…

9.3

3.9

5.3

…

3.1

-8.9

5.0

…

4.2

3.8

3.1

…

0.8

3.9

0.7

…

9.2

3.3

5.0

…

5.8

3.2

5.4

…

8.3

-2.3

9.5

…

2.1

3.8

5.4

…

8.3

4.5

1.7

…

0.9

1.2

0.1

…

3.9

9.3

3.5

…

1.3

-9.8

0.5

…

2.4

8.3

1.3

…

8.0

9.3

7.0

…

2.9

3.3

0.5

…

8.5

[batch, d_model, length]

2.6

Loss function (predict next token given previous)

[Fancy autocomplete]

104 of 105

8.2

4.5

…

3.8

2.0

5.9

…

1.2

6.9

4.5

…

3.8

9.1

7.1

…

9.0

8.1

1.0

…

9.3

3.1

5.3

…

3.1

5.1

5.0

…

4.2

4.4

3.1

…

0.8

3.7

0.7

…

9.2

0.1

5.0

…

5.8

8.2

4.5

…

3.8

2.0

5.9

…

1.2

6.9

4.5

…

3.8

9.1

7.1

…

9.0

8.1

1.0

…

9.3

3.1

5.3

…

3.1

5.1

5.0

…

4.2

4.4

3.1

…

0.8

3.7

0.7

…

9.2

0.1

5.0

…

5.8

Many words don't map to one token: indivisible.

Tokenization

[[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]

[3118, 291, 1098, 3435, 588, 795, 13210, 271, 743, 307, 6626]]

Embedding

N Transformer layers

Batched Shape

[batch]

[batch, length]

Batched Process

[batch, d_model, length]

2.3

4.5

…

3.8

-3.2

5.9

…

1.2

8.3

4.5

…

3.8

5.4

7.1

…

9.0

2.1

1.0

…

9.3

3.9

5.3

…

3.1

-8.9

5.0

…

4.2

3.8

3.1

…

0.8

3.9

0.7

…

9.2

3.3

5.0

…

5.8

3.2

5.4

…

8.3

-2.3

9.5

…

2.1

3.8

5.4

…

8.3

4.5

1.7

…

0.9

1.2

0.1

…

3.9

9.3

3.5

…

1.3

-9.8

0.5

…

2.4

8.3

1.3

…

8.0

9.3

7.0

…

2.9

3.3

0.5

…

8.5

[batch, d_model, length]

[Fancy autocomplete]

2.6

Loss function (predict next token given previous)

Most compute