1
Applied Data Analysis (CS401)
Lecture 11
Handling text data
Part II
26 Nov 2025
Maria Brbic
Announcements
Happy Thanksgiving!
2
Feedback
3
Give us feedback on this lecture here: https://go.epfl.ch/ada2025-lec11-feedback
Recap
4
Let me open my bag of tricks for bags of words for you! But only if you were good children...
words
docs
Reminder: bag-of-words matrix
Revisiting the 4 typical tasks
5
words
docs
Typical task 1: document retrieval
6
words
docs
Typical task 1: document retrieval
7
Typical task 2: document classification
8
words
docs
Typical task 3: sentiment analysis
9
words
docs
Regularization
10
x
x
minimize
Regularization
11
Which curve resulted from adding a regularization term to the loss function?
POLLING TIME
Typical task 4: topic detection
12
words
docs
What else can we do?
Typical task 4: topic detection
13
TF-IDF
≈
A
B
words
docs �
docs
“topics”
“topics”
words
Typical task 4: topic detection
14
Typical task 4: topic detection
You already know how to efficiently
compute this, from your linear algebra
class: singular-value decomposition (SVD)
15
Typical task 4: topic detection
16
17
Commercial break
LDA: probabilistic topic modeling
18
19
Topic inference in LDA
20
Question:
21
TF-IDF
words
docs
Sparsity in TF-IDF matrix
22
“Word vectors”
23
words
contexts
M
“Word vectors”
24
words
contexts
M
Beyond bags of words
25
From words to texts
26
Contextualized word vectors
27
BERT in a nutshell
context-�ualized word vectors
doc vector
Inside the black box: some nasty neural network
28
<START>
the
bat
flew
…
[1.00,0.70,0.90,0.50,0.06,…]
[0.54,0.75,0.56,0.45,0.09,…]
[0.44,0.76,0.77,0.31,0.82,…]
[0.91,0.62,0.53,0.75,0.74,…]
[0.92,0.37,0.25,0.49,0.24,…]
[0.85,0.62,0.71,0.11,0.58,…]
[0.49,0.25,0.22,0.36,0.75,…]
[0.61,0.87,0.73,0.96,0.52,…]
[0.58,0.02,0.01,0.92,0.76,…]
[0.53,0.42,0.64,0.26,0.01,…]
<START>
he
swung
the
bat
…
NLP pipeline
29
NLP pipeline
30
Today’s trend: generative language models
31
Feedback
32
Give us feedback on this lecture here: https://go.epfl.ch/ada2025-lec11-feedback