Machine Learning�Week 14 – Large Language Models
Deep Learning
Attention
Attention
Attention in Cognitive Science
Issues with Recurrent Models (1)
short-term dependency�: enough with RNN
long-term dependency�: NOT enough with RNN
O(sequence length)
Issues with Recurrent Models (2)
1
2
3
0
1
2
n
Numbers indicate min # of steps before a state can be computed
Issues with Recurrent Models (3)
BiLSTM Encoder
LSTM Decoder
Solution: Attention
Encoder-Decoder �with Attention
Solution: Attention
In a lookup table, we have a table of keys that map to values. The query matches one of the keys, returning its value.
In attention, the query matches all keys softly, to a weight between 0 and 1. The keys’ values are multiplied by the weights and summed.
query
d
keys
a
values
v1
b
v2
c
v3
d
v4
e
v5
output
v4
query
q
keys
k1
values
v1
k2
v2
k3
v3
k4
v4
k5
v5
output
weighted �sum
Attention for Machine Translation
https://alex.smola.org/talks/ICML19-attention.pdf
Sequence to Sequence Learning with Neural Networks (https://arxiv.org/pdf/1409.3215)
Attention for Machine Translation
https://alex.smola.org/talks/ICML19-attention.pdf
Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)
Effective Approaches to Attention-based Neural Machine Translation (https://arxiv.org/pdf/1508.04025)
Image generated by Gemini 3 Pro
Attention for Machine Translation
https://alex.smola.org/talks/ICML19-attention.pdf
Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)
Attention for Machine Translation
Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)
Attention for Image Captioning
Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention (https://arxiv.org/pdf/1502.03044)
Self-Attention as a New Building Block
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
Encoder-Decoder �with Attention
All words attend to all words in previous layer
Self-Attention as a New Building Block
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
Limitations of Self-Attention (1)
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
q
k
v
k
v
k
v
k
v
k
v
k
v
k
v
k
v
went
to
HUFS
LAI
at
2025
and
learned
No order information!
There is just summation
over the set
Position Representation Vectors
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
In deep self-attention networks, we do this at the first layer! You could concatenate them as well, but people mostly just add…
Position Representation Vectors
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://arxiv.org/pdf/1810.04805)
Limitations of Self-Attention (2)
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
query
q
keys
k1
values
v1
k2
v2
k3
v3
k4
v4
k5
v5
output
Even stacking multiple layers cannot introduce any non-linearity because it’s still a summation of value vectors
output
Adding nonlinearities in self-attention
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
Limitations of Self-Attention (3)
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
Masking the Future in Self-Attention
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
“Transformer”
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
Understanding Transformer Step-by-Step
Transformer Decoder
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
Multi-Head Attention
Attention head 1 �attends to entities
Attention head 2 attends to syntactically relevant words
q
k
v
k
v
k
v
k
v
k
v
k
v
k
v
k
v
went
to
HUFS
LAI
at
2025
and
learned
q
k
v
k
v
k
v
k
v
k
v
k
v
k
v
k
v
went
to
HUFS
LAI
at
2025
and
learned
Multi-Head Attention
https://web.stanford.edu/class/cs224n/
Attention Is All You Need (https://arxiv.org/pdf/1706.03762)
Multi-head attention = stacking multiple self-attention
Transformer …
Analysis of Transformer Attention
What Does BERT Look At? An Analysis of BERT’s Attention (https://arxiv.org/pdf/1906.04341)
Could We Trust Attention? (1)
Attention is not Explanation (https://arxiv.org/pdf/1902.10186)
Could We Trust Attention? (2)
Is Attention Interpretable? (https://arxiv.org/pdf/1906.03731)
Advanced Topics
Large Language Models
Today’s Topic
What is Language Model?
What is Language Model?
Example: Smartphone Keyboard
What is Language Model?
What is Language Model?
Autoregressive Language Models
Autoregressive Language Models
Autoregressive Language Models
Google Cloud Tech -Introduction to large language models (https://www.youtube.com/watch?v=zizonToFXDs)
Autoregressive Language Models
Autoregressive Language Models
Autoregressive Language Models
Autoregressive Language Models
Autoregressive Language Models
Autoregressive Language Models
Autoregressive Language Models
How to Train Language Models?
How to Train Language Models?
How to Train Language Models?
How to Train Language Models?
How to Train Language Models?
How to Train Language Models?
Generated by Gemini 3 Pro (Nano Banana Pro)
How to Train Language Models?
How to Train Language Models?
How to Train Language Models?
Generated by Gemini 3 Pro (Nano Banana Pro)
Rethinking Next Token Prediction
Generated by Gemini 3 Pro (Nano Banana Pro)
Next Token Prediction = Learning to Solve Tasks
Multitask Learning Emerges from Raw Text
Next Token Prediction = Knowledge Injection
Pre-training with Next Token Prediction
From Language Model to “Large” Language Model
Generated by Gemini 3 Pro (Nano Banana Pro)
What Makes a Language Model “Large”?
What Makes a Language Model “Large”?
What Makes a Language Model “Large”?
Language Models are Few-Shot Learners (https://arxiv.org/pdf/2005.14165)
What Makes a Language Model “Large”?
Language Models are Few-Shot Learners (https://arxiv.org/pdf/2005.14165)
What Makes a Language Model “Large”?
Language Models are Few-Shot Learners (https://arxiv.org/pdf/2005.14165)
What Makes a Language Model “Large”?
Many-Shot In-Context Learning (https://arxiv.org/pdf/2404.11018)
What Makes a Language Model “Large”?
What Makes a Language Model “Large”?
From Pre-trained LLMs to Practical Systems
How to Train Large Language Models?
Pre-training
Post-training
General �Pre-training
Domain �Pre-training
Mid-training
Instruction Tuning
Preference Tuning
Reinforcement Learning
Pre-training
Pre-training
Pre-training
Post-training
Post-training
Scaling Instruction-Finetuned Language Models (https://arxiv.org/pdf/2210.11416)
Post-training
https://huggingface.co/datasets/KORMo-Team/preference-dataset-qwen3/viewer/lose8B_win80B-A3B
https://huggingface.co/docs/transformers/main/chat_templating
User view
Data view
Model view
Post-training
Iterative Reasoning Preference Optimization (https://arxiv.org/pdf/2404.19733)
Post-training
Post-training
Training language models to follow instructions with human feedback (https://arxiv.org/pdf/2203.02155)
Post-training
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback (https://arxiv.org/pdf/2309.00267)
Post-training
MIT 6.S191 (Liquid AI): Large Language Models (https://www.youtube.com/watch?v=_HfdncCbMOE)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (https://arxiv.org/pdf/2305.18290)
LLM-related Framework
Masked Language Models
Masked Language Models
Conditional Language Models
How to Use Large Language Model?
How to Use Large Language Model?
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (https://arxiv.org/pdf/2201.11903)
Large Language Models are Zero-Shot Reasoners (https://arxiv.org/pdf/2205.11916)
How to Use Large Language Model?
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (https://arxiv.org/pdf/2005.11401)
How to Use Large Language Model?
How to Use Large Language Model?
Stanford CS230 | Autumn 2025 | Lecture 7: Agents, Prompts, and RAG.
How to Use Large Language Model?
Stanford CS230 | Autumn 2025 | Lecture 7: Agents, Prompts, and RAG.
How to Use Large Language Model?
Stanford CS230 | Autumn 2025 | Lecture 7: Agents, Prompts, and RAG.
How to Use Large Language Model?
Local LLM – Basics
https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct
🡪 MODEL_NAME = “LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct”
Local LLM – Basics
Local LLM – Basics
Local LLM – Basics
Local LLM – Basics
Local LLM – Basics
Local LLM – Basics
Local LLM – Basics
Local LLM – Basics
Local LLM – Basics
Local LLM – Basics
KV Caching Explained: Optimizing Transformer Inference Efficiency (https://huggingface.co/blog/not-lain/kv-caching)
Local LLM – Basics
KV Caching Explained: Optimizing Transformer Inference Efficiency (https://huggingface.co/blog/not-lain/kv-caching)
Local LLM – Basics
KV Caching Explained: Optimizing Transformer Inference Efficiency (https://huggingface.co/blog/not-lain/kv-caching)
KV Caching
Standard Inference
Local LLM – transformers & generate()
By Me!
Next
LLMs
Today’s Topic
Further Topics
References