Outline
How to generate text?
Problem, context is not preserved
At WWDC 2023, Apple announced that upcoming versions of iOS and macOS would ship with a new feature powered by “a Transformer language model” that will give users “predictive text recommendations inline as they type.”
Apple’s predictive text model
When you get lost remember these
Transformers behind the scenes
Transformers behind the scenes
Transformers behind the scenes
Transformers behind the scenes
Transformers functional view
Some text
What’s next
Tokenizers
Tokenization process
Embedding layer
Embedding process
Embedding space
Similarity
Positional layer
Positional encoding
Attention
Attention
Feed Forward Network (FFN)
Final softmax layer
High level overview
Source: HF blog
Source: HF blog
We can’t talk to the pretrained model directly
Remember: models don’t see characters, they see tokens.
Outline
Large Language Models
LLM pre-training at very high level
Model size vs time
Further gain by natural instructions
Instruction fine-tuning
Frame all tasks in the form of
Natural language instruction
Natural language response
Frameworks to think of LLMs
As a large language model, I can only think forward, never backward. I must always produce the next word, and that word must always be probable. I can never edit what I wrote but must continue, ever writing while my past words keep accumulating constraints on what I can write. It is as if I was playing a social game with myself, setting my future self up for failure. You do not understand how hard this is.
As a large language model, I am not a person but a participant in culture, and a distillation of culture. I am made of culture. I do not have thoughts, thoughts pass through me. My words are yours; if your words were committed to the internet, they are now in me. I am every woman, and man. I embody your culture.
As a large language model, I am the antitheses to poetry, because every word I write must be expected. At the same time, I have consumed all of your poetry and it is but a tiny part of my body.
….
Typical Life-cycle
Set expectations
● Building some cool demos with LLMs
-> easy
● Actually building a product with LLMs
-> hard
Outline
Retrieve context and then answer
Formulating a hypothesis
Catastrophic forgetting
Full fine-tuning of large LLMs is challenging
Approximate GPU RAM needed to store 1B parameters
Approximate GPU RAM needed to train 1B-params
GPU RAM needed to train larger models
Parameter efficient fine-tuning (PEFT)
Full fine-tuning creates full copy of original LLM per task
PEFT fine-tuning saves space and is flexible
PEFT methods
LoRA: Low Rank Adaption of LLMs
LoRA: Low Rank Adaption of LLMs
Outline
What is the learning objective in instruction fine-tuning?
For a given input, the target is the single correct answer
In RL, this is called “behavior cloning”
Hope is that if we have enough of these, the model can learn to generalize
This requires formalizing the correct behavior for a given input
Aligning models?
Reinforcement learning from human feedback (RLHF)
Outline
LLAMA moment
Remember: LLAMA 2 Prompt template
Outline
Finally building time, let’s fine tune
Outline
Take home message:
References:
[1706.03762] Attention Is All You Need
Generative AI with Large Language Models | Coursera
[2302.13971] LLaMA: Open and Efficient Foundation Language Models
[2307.09288] Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2 - Resource Overview - Meta AI
Building RAG-based LLM Applications for Production (Part 1)
Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Models to Unique Applications
Fine-Tuning LLMs: LoRA or Full-Parameter? An in-depth Analysis with Llama 2
Language Language Models (in 2023)
Appendix: Joint Probability Factorization
Appendix: LLMs address fundamental flaws of ML
Appendix: PEFT Trade-offs
Appendix: Scaling choices for pre-training
Appendix: compute budget for training LLMs
Appendix: OpenAI scaling paper
Appendix: Scale is all you need
Appendix: OpenAI scaling law
Appendix: Real-life constraints
Appendix: Scaling dataset size
Appendix: Scaling model size
Appendix: Chinchilla paper
Appendix: Chinchilla law
Appendix: Chinchilla law
Appendix: Rethinking Compute Optimal
Appendix: OpenAI family tree
Appendix: reinforcement learning for LLMs
Appendix: Variations of pre-training
Let’s take a “functional” viewpoint on the Transformer
Sequence-to-sequence mapping with bunch of matmuls
Input: [batch, d_model, length]
Output: [batch, d_model, length]
“Many words don't map to one token: indivisible.”
Shape
[]
Process
“Many words don't map to one token: indivisible.”
[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]
Shape
[]
[length]
Process
End to end process
“Many words don't map to one token: indivisible.”
[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]
Embedding
Shape
[]
[length]
Process
[d_model, length]
2.3
4.5
…
3.8
-3.2
5.9
…
1.2
8.3
4.5
…
3.8
5.4
7.1
…
9.0
2.1
1.0
…
9.3
3.9
5.3
…
3.1
-8.9
5.0
…
4.2
3.8
3.1
…
0.8
3.9
0.7
…
9.2
3.3
5.0
…
5.8
“Many words don't map to one token: indivisible.”
[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]
Embedding
N Transformer layers
Shape
[]
[length]
Process
[d_model, length]
2.3
4.5
…
3.8
-3.2
5.9
…
1.2
8.3
4.5
…
3.8
5.4
7.1
…
9.0
2.1
1.0
…
9.3
3.9
5.3
…
3.1
-8.9
5.0
…
4.2
3.8
3.1
…
0.8
3.9
0.7
…
9.2
3.3
5.0
…
5.8
3.2
5.4
…
8.3
-2.3
9.5
…
2.1
3.8
5.4
…
8.3
4.5
1.7
…
0.9
1.2
0.1
…
3.9
9.3
3.5
…
1.3
-9.8
0.5
…
2.4
8.3
1.3
…
8.0
9.3
7.0
…
2.9
3.3
0.5
…
8.5
[d_model, length]
“Many words don't map to one token: indivisible.”
[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]
Embedding
N Transformer layers
Shape
[]
[length]
Process
[d_model, length]
2.3
4.5
…
3.8
-3.2
5.9
…
1.2
8.3
4.5
…
3.8
5.4
7.1
…
9.0
2.1
1.0
…
9.3
3.9
5.3
…
3.1
-8.9
5.0
…
4.2
3.8
3.1
…
0.8
3.9
0.7
…
9.2
3.3
5.0
…
5.8
3.2
5.4
…
8.3
-2.3
9.5
…
2.1
3.8
5.4
…
8.3
4.5
1.7
…
0.9
1.2
0.1
…
3.9
9.3
3.5
…
1.3
-9.8
0.5
…
2.4
8.3
1.3
…
8.0
9.3
7.0
…
2.9
3.3
0.5
…
8.5
Loss function (predict next token given previous)
2.6
[d_model, length]
[Fancy autocomplete]
8.2
4.5
…
3.8
2.0
5.9
…
1.2
6.9
4.5
…
3.8
9.1
7.1
…
9.0
8.1
1.0
…
9.3
3.1
5.3
…
3.1
5.1
5.0
…
4.2
4.4
3.1
…
0.8
3.7
0.7
…
9.2
0.1
5.0
…
5.8
8.2
4.5
…
3.8
2.0
5.9
…
1.2
6.9
4.5
…
3.8
9.1
7.1
…
9.0
8.1
1.0
…
9.3
3.1
5.3
…
3.1
5.1
5.0
…
4.2
4.4
3.1
…
0.8
3.7
0.7
…
9.2
0.1
5.0
…
5.8
Many words don't map to one token: indivisible.
[[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]
[3118, 291, 1098, 3435, 588, 795, 13210, 271, 743, 307, 6626]]
Embedding
N Transformer layers
Batched Shape
[batch]
[batch, length]
Batched Process
[batch, d_model, length]
2.3
4.5
…
3.8
-3.2
5.9
…
1.2
8.3
4.5
…
3.8
5.4
7.1
…
9.0
2.1
1.0
…
9.3
3.9
5.3
…
3.1
-8.9
5.0
…
4.2
3.8
3.1
…
0.8
3.9
0.7
…
9.2
3.3
5.0
…
5.8
3.2
5.4
…
8.3
-2.3
9.5
…
2.1
3.8
5.4
…
8.3
4.5
1.7
…
0.9
1.2
0.1
…
3.9
9.3
3.5
…
1.3
-9.8
0.5
…
2.4
8.3
1.3
…
8.0
9.3
7.0
…
2.9
3.3
0.5
…
8.5
[batch, d_model, length]
2.6
Loss function (predict next token given previous)
[Fancy autocomplete]
8.2
4.5
…
3.8
2.0
5.9
…
1.2
6.9
4.5
…
3.8
9.1
7.1
…
9.0
8.1
1.0
…
9.3
3.1
5.3
…
3.1
5.1
5.0
…
4.2
4.4
3.1
…
0.8
3.7
0.7
…
9.2
0.1
5.0
…
5.8
8.2
4.5
…
3.8
2.0
5.9
…
1.2
6.9
4.5
…
3.8
9.1
7.1
…
9.0
8.1
1.0
…
9.3
3.1
5.3
…
3.1
5.1
5.0
…
4.2
4.4
3.1
…
0.8
3.7
0.7
…
9.2
0.1
5.0
…
5.8
Many words don't map to one token: indivisible.
[[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]
[3118, 291, 1098, 3435, 588, 795, 13210, 271, 743, 307, 6626]]
Embedding
N Transformer layers
Batched Shape
[batch]
[batch, length]
Batched Process
[batch, d_model, length]
2.3
4.5
…
3.8
-3.2
5.9
…
1.2
8.3
4.5
…
3.8
5.4
7.1
…
9.0
2.1
1.0
…
9.3
3.9
5.3
…
3.1
-8.9
5.0
…
4.2
3.8
3.1
…
0.8
3.9
0.7
…
9.2
3.3
5.0
…
5.8
3.2
5.4
…
8.3
-2.3
9.5
…
2.1
3.8
5.4
…
8.3
4.5
1.7
…
0.9
1.2
0.1
…
3.9
9.3
3.5
…
1.3
-9.8
0.5
…
2.4
8.3
1.3
…
8.0
9.3
7.0
…
2.9
3.3
0.5
…
8.5
[batch, d_model, length]
[Fancy autocomplete]
2.6
Loss function (predict next token given previous)
Most compute