1 of 58

LLM Training and Applications

Lecture 22

Pre-training, post-training, and using LLMs.

EECS 189/289, Fall 2025 @ UC Berkeley

Joseph E. Gonzalez and Narges Norouzi

EECS 189/289, Fall 2025 @ UC Berkeley

Joseph E. Gonzalez and Narges Norouzi

2 of 58

Recap – �Causal Language Modeling

3 of 58

Predicting the Next Word is Knowledge

Predicting the next word allows you to:

Answer questions
Tell stories
Accomplish tasks

Generative AI

How do we model the next word?

token

The capital of California is ____________________

Sacramento

San Francisco (1862)

Benicia (1853)

Vallejo (1852)

9688650

4 of 58

Quick Recap

Causal language modeling predict next token given context

Auto-regressive (iterative) decoding:

Pr( “EECS-189” | “The best class at UC Berkeley is”])

Model:

Context (ordered tokens)

Next Token

The best class at UC Berkeley is

The best class at UC Berkeley is EECS-189

The best class at UC Berkeley is EECS-189!

The best class at UC Berkeley is EECS-189!<stop>

9688650

5 of 58

Predicting the Next Token with a Transformer

6 of 58

Modeling the next token

Step 1: Tokenization of the Context

Pr(“ ”| “ ”])

Model:

Context (ordered tokens)

Next Token

the

cat

in

the

hat

the

cat

in

the

hat

9688650

7 of 58

Modeling the next token

Step 1: Tokenization of the Context

Pr(“ ”| “ ”])

Model:

Context (ordered tokens)

Next Token

the

cat

in

the

hat

the

cat

in

the

Token Id:

142

307

153

142

9688650

8 of 58

Embedding Table Lookup

Lookup embedding is learned embedding table.

Learned

Embedding

Table

Vocabulary

D

the

cat

in

the

142

307

153

142

9688650

9 of 58

Embedding Table Lookup

Lookup embedding is learned embedding table.

Learned

Embedding

Table

Vocabulary

D

the

cat

in

the

142

307

153

142

9688650

10 of 58

Positional Encoding (optional)

Add fixed positional encoding to embeddings based on position

Max Context Len.

D

the

cat

in

the

142

307

153

142

+

Pass into transformer architecture

9688650

11 of 58

Predicting the Next Token (Transformer)

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

the

142

in

153

cat

307

the

142

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

MLP

Layer Norm

Masked Attention

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

D

1

D

1

D

1

D

1

9688650

12 of 58

Predicting the Next Token (Transformer)

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

the

142

in

153

cat

307

the

142

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

MLP

Layer Norm

Masked Attention

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

D

1

D

1

D

1

D

1

Ignore these

D

1

Learned

Embedding

Table

Vocabulary

D

9688650

13 of 58

Predicting the Next Token (Transformer)

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

the

142

in

153

cat

307

the

142

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

MLP

Layer Norm

Masked Attention

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

D

1

D

1

D

1

D

1

Ignore these

D

1

Learned

Embedding

Table

Vocabulary

D

x

Softmax

Vocabulary

Pre-activation (logits)

1

Probability that each token is the next token

Vocabulary

1

9688650

14 of 58

Training the Model

15 of 58

What is Chat GPT?

Chat: natural language system

G: Generatively – Designed to model the creation of text

P: Pretrained – Trained on lots of naturally occurring data

T: Transformer – A kind of neural network architecture

Chat GPT is just one example of a�Large Language Model (LLM)

✅

9688650

16 of 58

Training the Model

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

the

142

in

153

cat

307

the

142

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

MLP

Layer Norm

Masked Attention

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

D

1

D

1

D

1

D

1

Learned

Embedding

Table

Vocabulary

D

Embedding^T

Softmax

9688650

17 of 58

Training the Model

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

the

142

in

153

cat

307

the

142

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

MLP

Layer Norm

Masked Attention

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

D

1

D

1

D

1

D

1

Learned

Embedding

Table

Vocabulary

D

Embedding^T

Softmax

in

cat

the

hat

Labels:

Minimize cross entropy loss (MLE)

9688650

18 of 58

Training the Model

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

the

142

in

153

cat

307

the

142

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

MLP

Layer Norm

Masked Attention

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

D

1

D

1

D

1

D

1

Learned

Embedding

Table

Vocabulary

D

Embedding^T

Softmax

in

cat

the

hat

Labels:

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

<start>

0

Embedding^T

Softmax

the

Minimize cross entropy loss (MLE)

Predict first word

9688650

19 of 58

For a sentence with N tokens how many MLE prediction tasks are there?

The Slido app must be installed on every computer you’re presenting from

Do not edit�How to change the design

9688650

20 of 58

Training the Model

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

the

142

in

153

cat

307

the

142

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

MLP

Layer Norm

Masked Attention

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

D

1

D

1

D

1

D

1

Learned

Embedding

Table

Vocabulary

D

Embedding^T

Softmax

in

cat

the

hat

Labels:

MLP

Layer Norm

+

Emb. Tbl.

Repeated L Times

+

Pos.

<start>

0

Embedding^T

Softmax

the

Tokens can attend to future tokens (the answers).

9688650

21 of 58

The Masked Attention

I

swam

to

the

bank

Symbols:

9688650

22 of 58

Unrolling �the�Model

Can

BPE

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Pr(you|Can)

9688650

23 of 58

Unrolling �the�Model

you

BPE

Attn.

FFN

Norm

Can

BPE

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Pr(you|Can)

Pr(predict|Can you)

9688650

24 of 58

Unrolling �the�Model

predict

BPE

Attn.

FFN

Norm

you

BPE

Attn.

FFN

Norm

Can

BPE

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Pr(you|Can)

Pr(predict|Can you)

Pr(the|Can you predict)

9688650

25 of 58

Unrolling �the�Model

the

BPE

Attn.

FFN

Norm

predict

BPE

Attn.

FFN

Norm

you

BPE

Attn.

FFN

Norm

Can

BPE

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Attn.

FFN

Norm

Pr(you|Can)

Pr(predict|Can you)

Pr(the|Can you predict)

Pr(next|Can you predict the)

9688650

26 of 58

Masked�Attention

Auto-regressive

Attend only to previous tokens

Decoder only Transformer

Pr(you|Can)

Pr(predict|Can you)

Pr(the|Can you predict)

Pr(next|Can you predict the)

the

BPE

FFN

Norm

predict

BPE

FFN

Norm

you

BPE

FFN

Norm

Can

BPE

FFN

Norm

FFN

Norm

FFN

Norm

FFN

Norm

FFN

Norm

FFN

Norm

FFN

Norm

FFN

Norm

FFN

Norm

FFN

Norm

FFN

Norm

FFN

Norm

Attn.

9688650

27 of 58

The Encoder-Decoder Transformer

The original transformer paper used both an encoder and a decoder. Why?

Encoder

Decoder

Hola, cómo estás?

<s> Hello, how are …

Attention

Masked Attention

Cross Attention

FFN

Hello

How

are

You

Masked Attention

Input

Output

Attention

FFN

Cross Attention

FFN

Encoder used for input �and the decoder was �used for output.

9688650

28 of 58

Visual Language Models

Use a visual encoder to convert image patches into token embeddings that are fed to the LLM as tokens.

Describe this image:

9688650

29 of 58

Visual Language Models

Use a visual encoder to convert image patches into token embeddings that are fed to the LLM as tokens.

Describe this image:

MLP

Attention

Lin.

9688650

30 of 58

Visual Language Models

Use a visual encoder to convert image patches into token embeddings that are fed to the LLM as tokens.

Describe this image:

MLP

Attention

Lin.

Learn “adapter” to transform image embeddings to token embeddings.

Vision Encoder

Learned Adapter

9688650

31 of 58

Visual Language Models

Use a visual encoder to convert image patches into token embeddings that are fed to the LLM as tokens.

Describe this image:

MLP

Attention

Lin.

Learn “adapter” to transform image embeddings to token embeddings.

Token Emb. Table

LLM

This …

Next Token Prediction:

Vision Encoder

Learned Adapter

9688650

32 of 58

Llama-3 Architecture

RMSNorm (Layer Norm.) –improve training stability
FFN with SwiGLU – �Feed forward network
Residual Connections – improve training stability
Repeated N Times –increase model size

BPE Token Embeddings

RMSNorm

Input Tokens

Grouped Query Attention

Q+RoPE

K+RoPE

V

RMSNorm

FFN with SwiGLU

RMSNorm

Linear

SoftMax

+

Repeated N Times

See Actual Code (its just one python script!)

9688650

33 of 58

BPE Token Embeddings

RMSNorm

Input Tokens

Grouped Query Attention

Q+RoPE

K+RoPE

V

RMSNorm

FFN with SwiGLU

RMSNorm

Linear

SoftMax

+

Repeated N Times

Multi-head

Self Attention

+

Layer Norm

MLP

+

Layer Norm

Post-Norm

Pre-Norm

(is better)

9688650

34 of 58

Going Big!

BPE Token Embeddings

RMSNorm

Input Tokens

Grouped Query Attention

Q+RoPE

K+RoPE

V

RMSNorm

FFN with SwiGLU

RMSNorm

Linear

SoftMax

+

Repeated N Times

Llama-3 70b Instruct: 8192 hidden size, 80 layers, 64 query heads, 8 kv heads.

Llama 2

9688650

35 of 58

Pre-training on Everything*

Train the model on a large collection of data to learn generalizable patterns

�Llama-3 “open-source” models trained on 15.6T tokens from an unknown data mix.

405-billion parameter model trained on 16K H100s ($25K each)
39.3M GPU hours

Links

*Everything that is legal to use for training hopefully…

OpenAI

GPT3 �Data Mix

9688650

36 of 58

mT5

T0

2019

2018

GPT-1

BERT

Word2Vec

GloVe

FastText

Encoder-Only

Decoder-Only

2020

Distill

BERT

GPT-2

T5

RoBERTa

ERNIE

XLNet

Encoder-Decoder

BART

2021

DeBERTa

ELECTRA

2022

ERNIE3.0

GLaM

Jurassic-1

MT-NLG

Gopher

Anthropic

LM

GPT-Neo

CodeX

GPT-J

GPT-3

2023

BLOOMZ

OPT-IML

YaLM

Galactica

PaLM

Minerva

BLOOM

OPT

UL2

Anthropic

LM_v4-s3

GPT-NeoX

ChatGPT

Sparrow

InstructGPT

LaMDA

Cohere

Chinchilla

GPT-4

Flan

UL2

LLaMA

Flan

PaLM

Jurassic-2

Switch

Tk

Flan

T5

ALBERT

GLM

Chat

GLM

ST-MoE

ULMFiT

ELMo

Bard

PaLM 2

Claude

LLaMA-2-Chat

Claude-2

Encoder Only

Encoder-Decoder

Decoder Only

https://github.com/Mooler0410/LLMsPracticalGuide/tree/main

37 of 58

Demo

Decoder Transformer Notebook

9688650

38 of 58

What is Chat GPT

Chat: natural language system

G: Generatively – Designed to model the creation of text

P: Pretrained – Trained on lots of naturally occurring data

T: Transformer – A kind of neural network architecture

Now you know

There is still one more thing

GPT alone can’t chat!

✅

9688650

39 of 58

Post-training

Tuning language models to be useful.

40 of 58

Need to Teach Model to �Follow Instructions (and be helpful!)

Generative pre-training captures knowledge

To finish the statement “Four score and seven years ago …” �you need to learn to memorize the text

Resulting model predicts the rest of a statement.

“What is attorney client privilege?” the model might generate “Provide a concise answer using an example from class.”

Use Supervised Fine-Tuning and RLHF to adjust “behavior” of model to follow instructions and chat like a human.

9688650

41 of 58

Supervised Fine-tuning (SFT)

Running additional training iters. with a specific task

The task differs from the original pre-training task

New objective: translate sentence, follow instruction
New training data from new source domain

Example: Chat conversations.

Smaller learning rate (as you get older you learn slower?)

Avoid “catastrophic forgetting” – additional fine-tuning causes loss in pre-training knowledge and other post-training (e.g., instruction following) behaviors.

9688650

42 of 58

Open-source Instruction Fine-Tuned LLMs�and LLM-as-a-judge

First open-source model that was �“comparable” to ChatGPT

Fine-tuned LlaMA-13B on ShareGPT Data

Helped launch academic open-source GenAI research

Vicuña

Tiny

High Quality Data

70K conversations (~800MB)

9688650

43 of 58

LoRA Fine-Tuning

9688650

44 of 58

Synthetic Data Augmentation �(Programming GenAI with GenAI)

Original Data

Synthetically

Enriched

Data

LLM

Prompts

Augment

Using an LLM to augment data to fine-tune an LLM

Enables us to:

Encode Behaviors
Teach domain knowledge

Fine-tune

LLM’

9688650

45 of 58

Reinforcement Learning from Human Feedback (RLHF)

Can further align model with human preferences using human preference data: “this is better than that”

��

Helps to make models more robust and perform better in safety situations but can be unstable and difficult to train.

Chat A

Chat B

>

Human Preference

Labels

SFT

LLM

Reward

Learned Reward�Model

LLM

That is such a great question! Let me help you break it down

Reinforcement

Learning

9688650

46 of 58

Direct Preference Optimization (DPO)

Directly apply Bradley-Terry Model (HW2) to recast preference learning as Maximum Likelihood Estimation.

Chat A

Chat B

>

Human Preference

Labels

SFT

LLM

That is such a great question! Let me help you break it down

From the paper:

https://arxiv.org/abs/2305.18290

Don’t need to memorize (just FYI).

Allows for direct optimization of the LLM

9688650

47 of 58

Prompting and �Test-time Compute

Using context to guide models and extract capabilities �through chain-of-thought

48 of 58

In-context �Learning

Zero-shot relies on model already �“knowing” how to complete the task.

9688650

49 of 58

Retrieval Augmented Generation (RAG)�Giving LLMs Specialized Knowledge

What happened in the stock market today?

News

Websites

Retriever

Given the following set of documents:

What happened in the stock market today?

Augmented Query:

LLM

9688650

50 of 58

Chain-of-thought Prompting

9688650

51 of 58

Reasoning in Large Language Models

Popularized by ChatGPT-O1

Behavior: leveraging long chain-of-thought with planning, reflection, self-refinement, and back-tracking.

How many of the numbers from the set $\{1,\ 2,\ 3,\ldots,\ 50\}$ have a perfect square factor other than one?

Prompt:

<|begin_of_thought|>

So I have this problem: "How many of the numbers from …

First, I need to understand what a perfect square is. A perfect square is …

One way to approach this is to count the numbers that are divisible by each of these perfect squares …

Let's list out the perfect squares less than or equal to 50: …

Next, I'll find the numbers in the set $\{1, 2, 3, \ldots, 50\}$ that are divisible by each …

Starting with 4: …

…

If I add all these up: 12 + 5 + 3 + 2 + 1 + 1 = 24.

But wait, some numbers might be counted multiple times because they are divisible by more …

So, I need to use the principle of inclusion-exclusion to avoid double-counting. …

…

But wait, let's double-check this ….

…

Alternatively, maybe I should …

…

Wait, that's 20 numbers. But earlier I had 23. There must be a mistake in my earlier reasoning.

…

So, the final answer is 19.<|end_of_thought|> <|begin_of_solution|> …

9688650

52 of 58

Reasoning in Large Language Models

Popularized by ChatGPT-O1

Behavior: leveraging long chain-of-thought with planning, reflection, self-refinement, and back-tracking.

How many of the numbers from the set $\{1,\ 2,\ 3,\ldots,\ 50\}$ have a perfect square factor other than one?

Prompt:

<|begin_of_thought|>

So I have this problem: "How many of the numbers from …

First, I need to understand what a perfect square is. A perfect square is …

One way to approach this is to count the numbers that are divisible by each of these perfect squares …

Let's list out the perfect squares less than or equal to 50: …

Next, I'll find the numbers in the set $\{1, 2, 3, \ldots, 50\}$ that are divisible by each …

Starting with 4: …

…

If I add all these up: 12 + 5 + 3 + 2 + 1 + 1 = 24.

But wait, some numbers might be counted multiple times because they are divisible by more …

So, I need to use the principle of inclusion-exclusion to avoid double-counting. …

…

But wait, let's double-check this ….

…

Alternatively, maybe I should …

…

Wait, that's 20 numbers. But earlier I had 23. There must be a mistake in my earlier reasoning.

…

So, the final answer is 19.<|end_of_thought|> <|begin_of_solution|> …

450 lines (3000 words) of “Reasoning”

9688650

53 of 58

Agentic Systems �Enabling LLMs to explore the world

What happened in the stock market today?

LLM

Respond

Search Web

Calculator

Email

Prompted Tools

Search Web

Tool Call

“Stock Market News”

To answer this question, I need to check the stock market news for today. I will search for “Stock Market News”

Reasoning

Tool Output

9688650

54 of 58

Agentic Systems �Enabling LLMs to explore the world

Given the news about tariffs, it would help to understand why the trade war was started before answering the question. I will search for “Trump Trade War”.

Reasoning

Tool Output

LLM

Respond

Search Web

Calculator

Email

Prompted Tools

To answer this question, I need to check the stock market news for today. I will search for “Stock Market News”

Reasoning

What happened in the stock market today?

LLM

Respond

Search Web

Calculator

Email

Prompted Tools

Search Web

Tool Call

“Stock Market News”

To answer this question, I need to check the stock market news for today. I will search for “Stock Market News”

Reasoning

9688650

55 of 58

Agentic Systems �Enabling LLMs to explore the world

What happened in the stock market today?

LLM

Respond

Search Web

Calculator

Email

Prompted Tools

To answer this question, I need to check the stock market news for today. I will search for “Stock Market News”

Reasoning

Search Web

Tool Call

“Stock Market News”

Given the news about tariffs, it would help to understand why the trade war was started before answering the question. I will search for “Trump Trade War”.

Reasoning

Tool Output

LLM

Respond

Search Web

Calculator

Email

Prompted Tools

To answer this question, I need to check the stock market news for today. I will search for “Stock Market News”

History

What happened in the stock market today?

9688650

56 of 58

Agentic Systems �Enabling LLMs to explore the world

What happened in the stock market today?

LLM

Respond

Search Web

Calculator

Email

Prompted Tools

To answer this question, I need to check the stock market news for today. I will search for “Stock Market News”

Reasoning

Search Web

Tool Call

“Stock Market News”

Given the news about tariffs, it would help to understand why the trade war was started before answering the question. I will search for “Trump Trade War”.

Reasoning

Tool Output

LLM

Respond

Search Web

Calculator

Email

Prompted Tools

To answer this question, I need to check the stock market news for today. I will search for “Stock Market News”

History

What happened in the stock market today?

Search Web

Tool Call

“Trump Trade War”

9688650

57 of 58

Agentic Systems �Enabling LLMs to explore the world

LLM

Calling Tools

Output

Human set’s the goal!

Iterative

Actions can �reveal information and modify the world

Reason-Act (ReAct) Agents: A lot of excitement about what these new systems can do.

9688650

58 of 58

LLM Training and Applications

Lecture 22

Reading: Chapter 10 in Bishop Deep Learning Textbook