LLM Training and Applications
Lecture 22
Pre-training, post-training, and using LLMs.
EECS 189/289, Fall 2025 @ UC Berkeley
Joseph E. Gonzalez and Narges Norouzi
EECS 189/289, Fall 2025 @ UC Berkeley
Joseph E. Gonzalez and Narges Norouzi
Recap – �Causal Language Modeling
Predicting the Next Word is Knowledge
Predicting the next word allows you to:
Generative AI
How do we model the next word?
token
The capital of California is ____________________
Sacramento
San Francisco (1862)
Benicia (1853)
Vallejo (1852)
9688650
Quick Recap
Pr( “EECS-189” | “The best class at UC Berkeley is”])
Model:
Context (ordered tokens)
Next Token
The best class at UC Berkeley is
The best class at UC Berkeley is EECS-189
The best class at UC Berkeley is EECS-189!
The best class at UC Berkeley is EECS-189!<stop>
9688650
Predicting the Next Token with a Transformer
Modeling the next token
Step 1: Tokenization of the Context
Pr(“ ”| “ ”])
Model:
Context (ordered tokens)
Next Token
the
cat
in
the
hat
the
cat
in
the
hat
9688650
Modeling the next token
Step 1: Tokenization of the Context
Pr(“ ”| “ ”])
Model:
Context (ordered tokens)
Next Token
the
cat
in
the
hat
the
cat
in
the
Token Id:
142
307
153
142
9688650
Embedding Table Lookup
Lookup embedding is learned embedding table.
Learned
Embedding
Table
Vocabulary
D
the
cat
in
the
142
307
153
142
9688650
Embedding Table Lookup
Lookup embedding is learned embedding table.
Learned
Embedding
Table
Vocabulary
D
the
cat
in
the
142
307
153
142
9688650
Positional Encoding (optional)
Add fixed positional encoding to embeddings based on position
Max Context Len.
D
the
cat
in
the
142
307
153
142
+
+
+
+
Pass into transformer architecture
9688650
Predicting the Next Token (Transformer)
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
the
142
in
153
cat
307
the
142
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
MLP
Layer Norm
Masked Attention
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
D
1
D
1
D
1
D
1
9688650
Predicting the Next Token (Transformer)
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
the
142
in
153
cat
307
the
142
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
MLP
Layer Norm
Masked Attention
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
D
1
D
1
D
1
D
1
Ignore these
D
1
Learned
Embedding
Table
Vocabulary
D
9688650
Predicting the Next Token (Transformer)
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
the
142
in
153
cat
307
the
142
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
MLP
Layer Norm
Masked Attention
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
D
1
D
1
D
1
D
1
Ignore these
D
1
Learned
Embedding
Table
Vocabulary
D
x
Softmax
Vocabulary
Pre-activation (logits)
1
Probability that each token is the next token
Vocabulary
1
9688650
Training the Model
What is Chat GPT?
Chat: natural language system
G: Generatively – Designed to model the creation of text
P: Pretrained – Trained on lots of naturally occurring data
T: Transformer – A kind of neural network architecture
Chat GPT is just one example of a�Large Language Model (LLM)
✅
✅
✅
9688650
Training the Model
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
the
142
in
153
cat
307
the
142
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
MLP
Layer Norm
Masked Attention
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
D
1
D
1
D
1
D
1
Learned
Embedding
Table
Vocabulary
D
EmbeddingT
EmbeddingT
EmbeddingT
EmbeddingT
Softmax
Softmax
Softmax
Softmax
9688650
Training the Model
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
the
142
in
153
cat
307
the
142
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
MLP
Layer Norm
Masked Attention
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
D
1
D
1
D
1
D
1
Learned
Embedding
Table
Vocabulary
D
EmbeddingT
EmbeddingT
EmbeddingT
EmbeddingT
Softmax
Softmax
Softmax
Softmax
in
cat
the
hat
Labels:
Minimize cross entropy loss (MLE)
9688650
Training the Model
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
the
142
in
153
cat
307
the
142
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
MLP
Layer Norm
Masked Attention
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
D
1
D
1
D
1
D
1
Learned
Embedding
Table
Vocabulary
D
EmbeddingT
EmbeddingT
EmbeddingT
EmbeddingT
Softmax
Softmax
Softmax
Softmax
in
cat
the
hat
Labels:
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
<start>
0
EmbeddingT
Softmax
the
Minimize cross entropy loss (MLE)
Predict first word
9688650
For a sentence with N tokens how many MLE prediction tasks are there?
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
9688650
Training the Model
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
the
142
in
153
cat
307
the
142
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
MLP
Layer Norm
Masked Attention
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
D
1
D
1
D
1
D
1
Learned
Embedding
Table
Vocabulary
D
EmbeddingT
EmbeddingT
EmbeddingT
EmbeddingT
Softmax
Softmax
Softmax
Softmax
in
cat
the
hat
Labels:
MLP
Layer Norm
Layer Norm
+
+
Emb. Tbl.
Repeated L Times
+
Pos.
<start>
0
EmbeddingT
Softmax
the
Tokens can attend to future tokens (the answers).
9688650
The Masked Attention
I
swam
to
the
bank
Symbols:
9688650
Unrolling �the�Model
Can
BPE
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Pr(you|Can)
9688650
Unrolling �the�Model
you
BPE
Attn.
FFN
Norm
Norm
Can
BPE
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Pr(you|Can)
Pr(predict|Can you)
9688650
Unrolling �the�Model
predict
BPE
Attn.
FFN
Norm
Norm
you
BPE
Attn.
FFN
Norm
Norm
Can
BPE
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Pr(you|Can)
Pr(predict|Can you)
Pr(the|Can you predict)
9688650
Unrolling �the�Model
the
BPE
Attn.
FFN
Norm
Norm
predict
BPE
Attn.
FFN
Norm
Norm
you
BPE
Attn.
FFN
Norm
Norm
Can
BPE
Attn.
FFN
Norm
Norm
next
BPE
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Attn.
FFN
Norm
Norm
Pr(you|Can)
Pr(predict|Can you)
Pr(the|Can you predict)
Pr(next|Can you predict the)
9688650
Masked�Attention
Auto-regressive
Attend only to previous tokens
Decoder only Transformer
Pr(you|Can)
Pr(predict|Can you)
Pr(the|Can you predict)
Pr(next|Can you predict the)
the
BPE
FFN
Norm
Norm
predict
BPE
FFN
Norm
Norm
you
BPE
FFN
Norm
Norm
Can
BPE
FFN
Norm
Norm
next
BPE
FFN
Norm
Norm
FFN
Norm
Norm
FFN
Norm
Norm
FFN
Norm
Norm
FFN
Norm
Norm
FFN
Norm
Norm
FFN
Norm
Norm
FFN
Norm
Norm
FFN
Norm
Norm
FFN
Norm
Norm
FFN
Norm
Norm
Attn.
Attn.
Attn.
Attn.
Attn.
Attn.
Attn.
Attn.
Attn.
Attn.
Attn.
Attn.
Attn.
Attn.
Attn.
9688650
The Encoder-Decoder Transformer
The original transformer paper used both an encoder and a decoder. Why?
Encoder
Decoder
Hola, cómo estás?
<s> Hello, how are …
Attention
Masked Attention
Cross Attention
FFN
FFN
FFN
FFN
FFN
FFN
FFN
Hello
How
are
You
Masked Attention
Input
Output
Attention
FFN
FFN
FFN
Cross Attention
FFN
FFN
FFN
FFN
Encoder used for input �and the decoder was �used for output.
9688650
Visual Language Models
Use a visual encoder to convert image patches into token embeddings that are fed to the LLM as tokens.
Describe this image:
9688650
Visual Language Models
Use a visual encoder to convert image patches into token embeddings that are fed to the LLM as tokens.
Describe this image:
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
Attention
Attention
Attention
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
9688650
Visual Language Models
Use a visual encoder to convert image patches into token embeddings that are fed to the LLM as tokens.
Describe this image:
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
Attention
Attention
Attention
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Learn “adapter” to transform image embeddings to token embeddings.
Vision Encoder
Learned Adapter
9688650
Visual Language Models
Use a visual encoder to convert image patches into token embeddings that are fed to the LLM as tokens.
Describe this image:
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
MLP
Attention
Attention
Attention
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Lin.
Learn “adapter” to transform image embeddings to token embeddings.
Token Emb. Table
LLM
This …
Next Token Prediction:
Vision Encoder
Learned Adapter
9688650
Llama-3 Architecture
BPE Token Embeddings
RMSNorm
Input Tokens
Grouped Query Attention
Q+RoPE
K+RoPE
V
RMSNorm
FFN with SwiGLU
RMSNorm
Linear
SoftMax
+
+
Repeated N Times
See Actual Code (its just one python script!)
9688650
BPE Token Embeddings
RMSNorm
Input Tokens
Grouped Query Attention
Q+RoPE
K+RoPE
V
RMSNorm
FFN with SwiGLU
RMSNorm
Linear
SoftMax
+
+
Repeated N Times
Multi-head
Self Attention
+
Layer Norm
MLP
+
Layer Norm
Post-Norm
Pre-Norm
(is better)
9688650
Going Big!
BPE Token Embeddings
RMSNorm
Input Tokens
Grouped Query Attention
Q+RoPE
K+RoPE
V
RMSNorm
FFN with SwiGLU
RMSNorm
Linear
SoftMax
+
+
Repeated N Times
Llama-3 70b Instruct: 8192 hidden size, 80 layers, 64 query heads, 8 kv heads.
Llama 2
9688650
Pre-training on Everything*
Train the model on a large collection of data to learn generalizable patterns
�Llama-3 “open-source” models trained on 15.6T tokens from an unknown data mix.
Links
*Everything that is legal to use for training hopefully…
OpenAI
GPT3 �Data Mix
9688650
mT5
T0
2019
2018
GPT-1
BERT
Word2Vec
GloVe
FastText
Encoder-Only
Decoder-Only
2020
Distill
BERT
GPT-2
T5
RoBERTa
ERNIE
XLNet
Encoder-Decoder
BART
2021
DeBERTa
ELECTRA
2022
ERNIE3.0
GLaM
Jurassic-1
MT-NLG
Gopher
Anthropic
LM
GPT-Neo
CodeX
GPT-J
GPT-3
2023
BLOOMZ
OPT-IML
YaLM
Galactica
PaLM
Minerva
BLOOM
OPT
UL2
Anthropic
LM_v4-s3
GPT-NeoX
ChatGPT
Sparrow
InstructGPT
LaMDA
Cohere
Chinchilla
GPT-4
Flan
UL2
LLaMA
Flan
PaLM
Jurassic-2
Switch
Tk
Flan
T5
ALBERT
GLM
Chat
GLM
ST-MoE
ULMFiT
ELMo
Bard
PaLM 2
Claude
LLaMA-2-Chat
Claude-2
Encoder Only
Encoder-Decoder
Decoder Only
https://github.com/Mooler0410/LLMsPracticalGuide/tree/main
Demo
Decoder Transformer Notebook
9688650
What is Chat GPT
Chat: natural language system
G: Generatively – Designed to model the creation of text
P: Pretrained – Trained on lots of naturally occurring data
T: Transformer – A kind of neural network architecture
Now you know
There is still one more thing
GPT alone can’t chat!
✅
✅
✅
9688650
Post-training
Tuning language models to be useful.
Need to Teach Model to �Follow Instructions (and be helpful!)
9688650
Supervised Fine-tuning (SFT)
Running additional training iters. with a specific task
The task differs from the original pre-training task
Example: Chat conversations.
Smaller learning rate (as you get older you learn slower?)
9688650
Open-source Instruction Fine-Tuned LLMs�and LLM-as-a-judge
First open-source model that was �“comparable” to ChatGPT
Fine-tuned LlaMA-13B on ShareGPT Data
Helped launch academic open-source GenAI research
Vicuña
Tiny
High Quality Data
70K conversations (~800MB)
9688650
LoRA Fine-Tuning
9688650
Synthetic Data Augmentation �(Programming GenAI with GenAI)
Original Data
Synthetically
Enriched
Data
LLM
Prompts
Augment
Using an LLM to augment data to fine-tune an LLM
Enables us to:
Fine-tune
LLM’
9688650
Reinforcement Learning from Human Feedback (RLHF)
Can further align model with human preferences using human preference data: “this is better than that”
��
Helps to make models more robust and perform better in safety situations but can be unstable and difficult to train.
Chat A
Chat B
>
Human Preference
Labels
SFT
LLM
Reward
Learned Reward�Model
LLM
That is such a great question! Let me help you break it down
Reinforcement
Learning
9688650
Direct Preference Optimization (DPO)
Directly apply Bradley-Terry Model (HW2) to recast preference learning as Maximum Likelihood Estimation.
Chat A
Chat B
>
Human Preference
Labels
SFT
LLM
That is such a great question! Let me help you break it down
From the paper:
Don’t need to memorize (just FYI).
Allows for direct optimization of the LLM
9688650
Prompting and �Test-time Compute
Using context to guide models and extract capabilities �through chain-of-thought
In-context �Learning
Zero-shot relies on model already �“knowing” how to complete the task.
9688650
Retrieval Augmented Generation (RAG)�Giving LLMs Specialized Knowledge
What happened in the stock market today?
News
Websites
Retriever
Given the following set of documents:
What happened in the stock market today?
Augmented Query:
LLM
9688650
Chain-of-thought Prompting
9688650
Reasoning in Large Language Models
Popularized by ChatGPT-O1
Behavior: leveraging long chain-of-thought with planning, reflection, self-refinement, and back-tracking.
How many of the numbers from the set $\{1,\ 2,\ 3,\ldots,\ 50\}$ have a perfect square factor other than one?
Prompt:
<|begin_of_thought|>
So I have this problem: "How many of the numbers from …
First, I need to understand what a perfect square is. A perfect square is …
One way to approach this is to count the numbers that are divisible by each of these perfect squares …
Let's list out the perfect squares less than or equal to 50: …
Next, I'll find the numbers in the set \(\{1, 2, 3, \ldots, 50\}\) that are divisible by each …
Starting with 4: …
…
If I add all these up: 12 + 5 + 3 + 2 + 1 + 1 = 24.
But wait, some numbers might be counted multiple times because they are divisible by more …
So, I need to use the principle of inclusion-exclusion to avoid double-counting. …
…
But wait, let's double-check this ….
…
Alternatively, maybe I should …
…
Wait, that's 20 numbers. But earlier I had 23. There must be a mistake in my earlier reasoning.
…
So, the final answer is 19.<|end_of_thought|> <|begin_of_solution|> …
9688650
Reasoning in Large Language Models
Popularized by ChatGPT-O1
Behavior: leveraging long chain-of-thought with planning, reflection, self-refinement, and back-tracking.
How many of the numbers from the set $\{1,\ 2,\ 3,\ldots,\ 50\}$ have a perfect square factor other than one?
Prompt:
<|begin_of_thought|>
So I have this problem: "How many of the numbers from …
First, I need to understand what a perfect square is. A perfect square is …
One way to approach this is to count the numbers that are divisible by each of these perfect squares …
Let's list out the perfect squares less than or equal to 50: …
Next, I'll find the numbers in the set \(\{1, 2, 3, \ldots, 50\}\) that are divisible by each …
Starting with 4: …
…
If I add all these up: 12 + 5 + 3 + 2 + 1 + 1 = 24.
But wait, some numbers might be counted multiple times because they are divisible by more …
So, I need to use the principle of inclusion-exclusion to avoid double-counting. …
…
But wait, let's double-check this ….
…
Alternatively, maybe I should …
…
Wait, that's 20 numbers. But earlier I had 23. There must be a mistake in my earlier reasoning.
…
So, the final answer is 19.<|end_of_thought|> <|begin_of_solution|> …
450 lines (3000 words) of “Reasoning”
9688650
Agentic Systems �Enabling LLMs to explore the world
What happened in the stock market today?
LLM
Respond
Search Web
Calculator
Prompted Tools
Search Web
Tool Call
“Stock Market News”
To answer this question, I need to check the stock market news for today. I will search for “Stock Market News”
Reasoning
Tool Output
9688650
Agentic Systems �Enabling LLMs to explore the world
Given the news about tariffs, it would help to understand why the trade war was started before answering the question. I will search for “Trump Trade War”.
Reasoning
Tool Output
LLM
Respond
Search Web
Calculator
Prompted Tools
To answer this question, I need to check the stock market news for today. I will search for “Stock Market News”
Reasoning
What happened in the stock market today?
LLM
Respond
Search Web
Calculator
Prompted Tools
Search Web
Tool Call
“Stock Market News”
To answer this question, I need to check the stock market news for today. I will search for “Stock Market News”
Reasoning
9688650
Agentic Systems �Enabling LLMs to explore the world
What happened in the stock market today?
LLM
Respond
Search Web
Calculator
Prompted Tools
To answer this question, I need to check the stock market news for today. I will search for “Stock Market News”
Reasoning
Search Web
Tool Call
“Stock Market News”
Given the news about tariffs, it would help to understand why the trade war was started before answering the question. I will search for “Trump Trade War”.
Reasoning
Tool Output
LLM
Respond
Search Web
Calculator
Prompted Tools
To answer this question, I need to check the stock market news for today. I will search for “Stock Market News”
History
What happened in the stock market today?
9688650
Agentic Systems �Enabling LLMs to explore the world
What happened in the stock market today?
LLM
Respond
Search Web
Calculator
Prompted Tools
To answer this question, I need to check the stock market news for today. I will search for “Stock Market News”
Reasoning
Search Web
Tool Call
“Stock Market News”
Given the news about tariffs, it would help to understand why the trade war was started before answering the question. I will search for “Trump Trade War”.
Reasoning
Tool Output
LLM
Respond
Search Web
Calculator
Prompted Tools
To answer this question, I need to check the stock market news for today. I will search for “Stock Market News”
History
What happened in the stock market today?
Search Web
Tool Call
“Trump Trade War”
9688650
Agentic Systems �Enabling LLMs to explore the world
LLM
Calling Tools
Output
Human set’s the goal!
Iterative
Actions can �reveal information and modify the world
Reason-Act (ReAct) Agents: A lot of excitement about what these new systems can do.
9688650
LLM Training and Applications
Lecture 22
Reading: Chapter 10 in Bishop Deep Learning Textbook