1 of 32

1

Applied Data Analysis (CS401)

Lecture 11

Handling text data

Part II

26 Nov 2025

Maria Brbic

2 of 32

Announcements

To all ADAmericans:

Happy Thanksgiving!

Homework H2 due this today EOD

Reminder: We won’t answer questions asked during final 24h

Projects:

Milestone P2 grades have been released
Project Milestone P3 released today!

Friday’s lab session:

Exercises on handling text (Exercise 10)

2

3 of 32

Feedback

3

Give us feedback on this lecture here: https://go.epfl.ch/ada2025-lec11-feedback

What did you (not) like about this lecture?
What was (not) well explained?
On what would you like more (fewer) details?
…

4 of 32

Recap

4

Let me open my bag of tricks for bags of words for you! But only if you were good children...

words

docs

Reminder: bag-of-words matrix

5 of 32

Revisiting the 4 typical tasks

Document retrieval
Document classification
Sentiment analysis
Topic detection

TF-IDF matrix to the rescue

Entry for doc d, word w:�tf(d, w) * idf(w)

5

words

docs

6 of 32

Typical task 1: document retrieval

Nearest-neighbor method in spirit of kNN
Compare query doc q to all documents�in the collection (i.e., rows of the TF-IDF�matrix)
Rank docs from collection in increasing order of distance
Distance metrics

Typically cosine distance (= 1 – cosine similarity)
Recall: cosine similarity of q and v = <q/|q|, v/|v|>
If rows are L2-normalized, may simply take dot products <q, v>

6

words

docs

7 of 32

Typical task 1: document retrieval

This is just the most basic approach
For efficiency

Start by filtering documents by presence of query terms (use efficient full-text index)
Hugely narrows down set of documents to be ranked

Google et al. do much more…

Query-independent relevance: PageRank
Boost recent results
Personalization, contextualization
…

7

8 of 32

Typical task 2: document classification

Use TF-IDF matrix as feature matrix for�supervised methods (cf. lecture 7)
Often more features (words) than documents
What’s the danger with this?

High model capacity can lead to overfitting (high variance)
Potential solutions:

Use more data (i.e., more labeled training docs)
Decrease model capacity:

Feature selection
Regularization (two slides from now)
Dimensionality reduction (a few slides from now)

Use ensemble methods such as random forests

8

words

docs

9 of 32

Typical task 3: sentiment analysis

When treated as classification:�Ctrl-C Ctrl-V previous slide�
When treated as regression:�Pretty much the same (most supervised�methods work for both classification�and regression)

9

words

docs

10 of 32

Regularization

E.g., linear regression:�Find weight vector β that minimizes��(x_i: feature vector of i-th data point; y_i: label of i-th data point, e.g., sentiment [1–5 stars] expressed in document i)
If one word j appears only in docs with sentiment 5, we can obtain very small training error on these docs by making β_j large enough
But doesn’t generalize to unseen test data!
Remedy: penalize very large positive and very large negative weights:

10

x

minimize

11 of 32

Regularization

11

Which curve resulted from adding a regularization term to the loss function?

POLLING TIME

Scan QR code or go to�https://go.epfl.ch/ada2025-lec11-poll

12 of 32

Typical task 4: topic detection

Cluster rows of TF-IDF matrix (each row�a data point)
Manually inspect clusters and label them�with descriptive names (e.g., “news”, “sports”,�“romance”, “tech”, “politics”)
In principle, may use k-means, k-medoids, etc.
But can be difficult if dimensionality is large (#words ≫ #docs)

“Curse of dimensionality”
Many outliers

12

words

docs

What else can we do?

13 of 32

Typical task 4: topic detection

Alternative approach: matrix factorization

Assume docs and words have representation in (latent) “topic space”
(IDF-weighted) word frequency modeled as dot product of doc’s vectors and word’s vectors in topic space
#topics ≪ #words (→ “dimensionality reduction”): D*W → (D+W)*T
Topics interpretable in doc space (A’s cols) and word space (B’s rows)

13

TF-IDF

≈

A

B

words

docs �

docs

“topics”

words

14 of 32

Typical task 4: topic detection

Optimization problem:

Find A, B such that AB is as�close to TF-IDF matrix as possible
That is, minimize ��where T is TF-IDF matrix, A_d is d-th row of A, and B_w is w-th column of B

This is called latent semantic analysis (LSA)

14

15 of 32

Typical task 4: topic detection

You already know how to efficiently

compute this, from your linear algebra

class: singular-value decomposition (SVD)

T = USV^T
Freebie: columns of U and V are orthonormal bases (yay!)
S is diagonal (with values in decreasing order) and captures “importance” of topic (amount of variation in corpus w.r.t. topic)
If you want k topics, keep only the first k columns of U and V, and the first k rows and columns of S�→ U’, S’, V’
E.g., A = U’, B = S’V’^T or A = U’S’, B = V’^T

15

16 of 32

Typical task 4: topic detection

Recall potential problem with�clustering and classification and�regression: “curse of dimensionality”
Matrix factorization via LSA solves these problems for you:

Use A instead of original TF-IDF matrix
That is, cluster (or learn to classify or regress) in topic space, rather than word space

Topic representation from LSA is simply a vector, not a probability distribution over topics
Probabilistic: LDA = Latent Dirichlet Allocation (p.t.o.)

16

17 of 32

17

Commercial break

18 of 32

LDA: probabilistic topic modeling

Latent Dirichlet Allocation (not Latent Discriminant Analysis!)
Document := bag of words
Topic := probability distribution over words
Each document has a (latent) distribution over topics
“Generative story” for generating a doc of length n:�d := sample a topic distribution for the doc (← “Dirichlet”)�for i = 1, …, n� t := sample a topic from topic distribution d� w := sample a word from topic t� Add w to the bag of words of the doc to be generated

18

19 of 32

19

20 of 32

Topic inference in LDA

LDA is unsupervised (topics come out “magically”)
Input:

Docs represented as bags of words
Number K of topics

Output:

K topics (distributions over words)
For each doc: distribution over K topics

How is this done?

Find distributions (i.e., topics, docs) that maximize the likelihood of the observed documents (maximum likelihood)

20

21 of 32

Question:

“Which of these word pairs is more closely related?”

(car, bus)
(car, astronaut)

How to quantify this?

Detour:

How to quantify closeness of two docs?
E.g., cosine of rows of TF-IDF matrix

Retour:

How to quantify closeness of two words?
E.g., cosine of cols of TF-IDF matrix

21

TF-IDF

words

docs

22 of 32

Sparsity in TF-IDF matrix

Two docs (i.e., rows of TF-IDF matrix)

“Do you love men?”
“Adorest thou the likes of Adam?”
Cosine of row vectors of TF-IDF matrix == 0

Same problem when comparing two words (i.e., cols of matrix)
Solution:

Move from sparse to dense vectors
But how?

Latent semantic analysis (LSA)!

Use columns of B as dense�vectors representing words

22

23 of 32

“Word vectors”

Columns of TF-IDF matrix (sparse)�or of word-by-topic matrix B (dense)
Problem:

Entire doc treated as one bag of words
All information about word proximity, syntax, etc., is lost

Solution:

Instead of full docs, consider local contexts:�windows of L (e.g., 3) consecutive words to�left and right of the target word
Rows of matrix: not docs, but contexts

23

words

contexts

M

24 of 32

“Word vectors”

word2vec: factor PMI�matrix and use columns�of B as word vectors�

What to use as entries of word/context matrix?

Straightforward: same as TF-IDF, but with�contexts as “pseudo-docs”: M[c,w] = TF-IDF(c,w)

May use any other measures of statistical association
E.g., pointwise mutual information (PMI):�M[c,w] = PMI(c,w) = �“How much more likely are c and w to occur together than if they were independent?”

24

words

contexts

M

25 of 32

Beyond bags of words

25

26 of 32

From words to texts

Word vectors represent, well, words
How to represent larger units, such as sentences, paragraphs, docs?
Typical approach: take sum/average of word vectors
Note: this is roughly also what bags of words are (when using “one-hot” encoding for words, i.e., vector with exactly one 1, rest 0)
More recently: learn vectors for longer units

Cr5, sent2vec
Convolutional neural networks
Recurrent neural networks, e.g., LSTM, ELMo
Transformer-based models, e.g., BERT (next slides), GPT-*

26

27 of 32

Contextualized word vectors

Motivating example:

“The bat flew into the cave.”
vs. “He swung the bat and hit a home run.”

Classic word vectors (e.g., word2vec)�cannot distinguish these two cases;�same vector used for both instances of “bat”

Solution: contextualized word vectors

E.g., BERT

27

28 of 32

BERT in a nutshell

Introduced in 2018 by Google Research

context-�ualized word vectors

doc vector

Inside the black box: some nasty neural network

28

<START>

the

bat

flew

…

[1.00,0.70,0.90,0.50,0.06,…]

[0.54,0.75,0.56,0.45,0.09,…]

[0.44,0.76,0.77,0.31,0.82,…]

[0.91,0.62,0.53,0.75,0.74,…]

[0.92,0.37,0.25,0.49,0.24,…]

[0.85,0.62,0.71,0.11,0.58,…]

[0.49,0.25,0.22,0.36,0.75,…]

[0.61,0.87,0.73,0.96,0.52,…]

[0.58,0.02,0.01,0.92,0.76,…]

[0.53,0.42,0.64,0.26,0.01,…]

<START>

he

swung

the

bat

…

29 of 32

NLP pipeline

Tokenization
Sentence splitting
Part-of-speech (POS) tagging
Named-entity recognition (NER)
Coreference resolution
Parsing

Shallow parsing�(a.k.a. chunking)
Constituency parsing
Dependency parsing

29

30 of 32

NLP pipeline

Implemented by Stanford CoreNLP, nltk, spaCy, etc.
Sequential model

Fixed order of steps
Early errors will propagate downstream
Fixed order not optimal for all cases (e.g., syntax usually done before semantics, but semantics might be useful for inferring syntax)

Hence, current research: learn all tasks jointly (early example)
To learn how all this magic is implemented

Take CS-431 (Intro to NLP), CS-552 (Modern NLP)

30

31 of 32

Today’s trend: generative language models

E.g., OpenAI GPT-*, ChatGPT
Input: text
Output: text
Many NLP tasks can be formulated in this framework, by “prompting” the language model with the right input

31

32 of 32

Feedback

32

Give us feedback on this lecture here: https://go.epfl.ch/ada2025-lec11-feedback

What did you (not) like about this lecture?
What was (not) well explained?
On what would you like more (fewer) details?
…