1 of 32

1

Applied Data Analysis (CS401)

Lecture 11

Handling text data

Part II

26 Nov 2025

Maria Brbic

2 of 32

Announcements

  • To all ADAmericans:

Happy Thanksgiving!

  • Homework H2 due this today EOD
    • Reminder: We won’t answer questions asked during final 24h
  • Projects:
    • Milestone P2 grades have been released
    • Project Milestone P3 released today!
  • Friday’s lab session:
    • Exercises on handling text (Exercise 10)

2

3 of 32

Feedback

3

Give us feedback on this lecture here: https://go.epfl.ch/ada2025-lec11-feedback

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?

4 of 32

Recap

4

Let me open my bag of tricks for bags of words for you! But only if you were good children...

words

docs

Reminder: bag-of-words matrix

5 of 32

Revisiting the 4 typical tasks

  • Document retrieval
  • Document classification
  • Sentiment analysis
  • Topic detection

  • TF-IDF matrix to the rescue
    • Entry for doc d, word w:�tf(d, w) * idf(w)

5

words

docs

6 of 32

Typical task 1: document retrieval

  • Nearest-neighbor method in spirit of kNN
  • Compare query doc q to all documents�in the collection (i.e., rows of the TF-IDF�matrix)
  • Rank docs from collection in increasing order of distance
  • Distance metrics
    • Typically cosine distance (= 1 – cosine similarity)
    • Recall: cosine similarity of q and v = <q/|q|, v/|v|>
    • If rows are L2-normalized, may simply take dot products <q, v>

6

words

docs

7 of 32

Typical task 1: document retrieval

  • This is just the most basic approach
  • For efficiency
    • Start by filtering documents by presence of query terms (use efficient full-text index)
    • Hugely narrows down set of documents to be ranked
  • Google et al. do much more…
    • Query-independent relevance: PageRank
    • Boost recent results
    • Personalization, contextualization

7

8 of 32

Typical task 2: document classification

  • Use TF-IDF matrix as feature matrix for�supervised methods (cf. lecture 7)
  • Often more features (words) than documents
  • What’s the danger with this?
  • High model capacity can lead to overfitting (high variance)
  • Potential solutions:
    • Use more data (i.e., more labeled training docs)
    • Decrease model capacity:
      • Feature selection
      • Regularization (two slides from now)
      • Dimensionality reduction (a few slides from now)
    • Use ensemble methods such as random forests

8

words

docs

9 of 32

Typical task 3: sentiment analysis

  • When treated as classification:�Ctrl-C Ctrl-V previous slide�
  • When treated as regression:�Pretty much the same (most supervised�methods work for both classification�and regression)

9

words

docs

10 of 32

Regularization

  • E.g., linear regression:�Find weight vector β that minimizes��(xi: feature vector of i-th data point; yi: label of i-th data point, e.g., sentiment [1–5 stars] expressed in document i)
  • If one word j appears only in docs with sentiment 5, we can obtain very small training error on these docs by making βj large enough
  • But doesn’t generalize to unseen test data!
  • Remedy: penalize very large positive and very large negative weights:

10

x

x

minimize

11 of 32

Regularization

11

Which curve resulted from adding a regularization term to the loss function?

POLLING TIME

  • Scan QR code or go to�https://go.epfl.ch/ada2025-lec11-poll

12 of 32

Typical task 4: topic detection

  • Cluster rows of TF-IDF matrix (each row�a data point)
  • Manually inspect clusters and label them�with descriptive names (e.g., “news”, “sports”,�“romance”, “tech”, “politics”)
  • In principle, may use k-means, k-medoids, etc.
  • But can be difficult if dimensionality is large (#words ≫ #docs)
    • “Curse of dimensionality”
    • Many outliers

12

words

docs

What else can we do?

13 of 32

Typical task 4: topic detection

  • Alternative approach: matrix factorization

  • Assume docs and words have representation in (latent) “topic space”
  • (IDF-weighted) word frequency modeled as dot product of doc’s vectors and word’s vectors in topic space
  • #topics ≪ #words (→ “dimensionality reduction”): D*W → (D+W)*T
  • Topics interpretable in doc space (A’s cols) and word space (B’s rows)

13

TF-IDF

A

B

words

docs �

docs

“topics”

“topics”

words

14 of 32

Typical task 4: topic detection

  • Optimization problem:
    • Find A, B such that AB is as�close to TF-IDF matrix as possible
    • That is, minimize ��where T is TF-IDF matrix, Ad is d-th row of A, and Bw is w-th column of B
  • This is called latent semantic analysis (LSA)

14

15 of 32

Typical task 4: topic detection

You already know how to efficiently

compute this, from your linear algebra

class: singular-value decomposition (SVD)

  • T = USVT
  • Freebie: columns of U and V are orthonormal bases (yay!)
  • S is diagonal (with values in decreasing order) and captures “importance” of topic (amount of variation in corpus w.r.t. topic)
  • If you want k topics, keep only the first k columns of U and V, and the first k rows and columns of S�→ U’, S’, V
  • E.g., A = U’, B = SVT or A = US’, B = VT

15

16 of 32

Typical task 4: topic detection

  • Recall potential problem with�clustering and classification and�regression: “curse of dimensionality”
  • Matrix factorization via LSA solves these problems for you:
    • Use A instead of original TF-IDF matrix
    • That is, cluster (or learn to classify or regress) in topic space, rather than word space

  • Topic representation from LSA is simply a vector, not a probability distribution over topics
  • Probabilistic: LDA = Latent Dirichlet Allocation (p.t.o.)

16

17 of 32

17

Commercial break

18 of 32

LDA: probabilistic topic modeling

  • Latent Dirichlet Allocation (not Latent Discriminant Analysis!)
  • Document := bag of words
  • Topic := probability distribution over words
  • Each document has a (latent) distribution over topics
  • “Generative story” for generating a doc of length n:�d := sample a topic distribution for the doc (← “Dirichlet”)�for i = 1, …, nt := sample a topic from topic distribution dw := sample a word from topic t� Add w to the bag of words of the doc to be generated

18

19 of 32

19

20 of 32

Topic inference in LDA

  • LDA is unsupervised (topics come out “magically”)
  • Input:
    • Docs represented as bags of words
    • Number K of topics
  • Output:
    • K topics (distributions over words)
    • For each doc: distribution over K topics
  • How is this done?
    • Find distributions (i.e., topics, docs) that maximize the likelihood of the observed documents (maximum likelihood)

20

21 of 32

Question:

  • “Which of these word pairs is more closely related?”
    • (car, bus)
    • (car, astronaut)
  • How to quantify this?
  • Detour:
    • How to quantify closeness of two docs?
    • E.g., cosine of rows of TF-IDF matrix
  • Retour:
    • How to quantify closeness of two words?
    • E.g., cosine of cols of TF-IDF matrix

21

TF-IDF

words

docs

22 of 32

Sparsity in TF-IDF matrix

  • Two docs (i.e., rows of TF-IDF matrix)
    • “Do you love men?”
    • “Adorest thou the likes of Adam?”
    • Cosine of row vectors of TF-IDF matrix == 0
  • Same problem when comparing two words (i.e., cols of matrix)
  • Solution:
    • Move from sparse to dense vectors
    • But how?
    • Latent semantic analysis (LSA)!
      • Use columns of B as dense�vectors representing words

22

23 of 32

“Word vectors”

  • Columns of TF-IDF matrix (sparse)�or of word-by-topic matrix B (dense)
  • Problem:
    • Entire doc treated as one bag of words
    • All information about word proximity, syntax, etc., is lost
  • Solution:
    • Instead of full docs, consider local contexts:�windows of L (e.g., 3) consecutive words to�left and right of the target word
    • Rows of matrix: not docs, but contexts

23

words

contexts

M

24 of 32

“Word vectors”

  • word2vec: factor PMI�matrix and use columns�of B as word vectors�
  • What to use as entries of word/context matrix?
  • Straightforward: same as TF-IDF, but with�contexts as “pseudo-docs”: M[c,w] = TF-IDF(c,w)
  • May use any other measures of statistical association
  • E.g., pointwise mutual information (PMI):�M[c,w] = PMI(c,w) = �“How much more likely are c and w to occur together than if they were independent?”

24

words

contexts

M

25 of 32

Beyond bags of words

25

26 of 32

From words to texts

  • Word vectors represent, well, words
  • How to represent larger units, such as sentences, paragraphs, docs?
  • Typical approach: take sum/average of word vectors
  • Note: this is roughly also what bags of words are (when using “one-hot” encoding for words, i.e., vector with exactly one 1, rest 0)
  • More recently: learn vectors for longer units
    • Cr5, sent2vec
    • Convolutional neural networks
    • Recurrent neural networks, e.g., LSTM, ELMo
    • Transformer-based models, e.g., BERT (next slides), GPT-*

26

27 of 32

Contextualized word vectors

  • Motivating example:
    • “The bat flew into the cave.”
    • vs. “He swung the bat and hit a home run.”

  • Classic word vectors (e.g., word2vec)�cannot distinguish these two cases;�same vector used for both instances of “bat

  • Solution: contextualized word vectors
    • E.g., BERT

27

28 of 32

BERT in a nutshell

  • Introduced in 2018 by Google Research

context-�ualized word vectors

doc vector

Inside the black box: some nasty neural network

28

<START>

the

bat

flew

[1.00,0.70,0.90,0.50,0.06,…]

[0.54,0.75,0.56,0.45,0.09,…]

[0.44,0.76,0.77,0.31,0.82,…]

[0.91,0.62,0.53,0.75,0.74,…]

[0.92,0.37,0.25,0.49,0.24,…]

[0.85,0.62,0.71,0.11,0.58,…]

[0.49,0.25,0.22,0.36,0.75,…]

[0.61,0.87,0.73,0.96,0.52,…]

[0.58,0.02,0.01,0.92,0.76,…]

[0.53,0.42,0.64,0.26,0.01,…]

<START>

he

swung

the

bat

29 of 32

NLP pipeline

  • Tokenization
  • Sentence splitting
  • Part-of-speech (POS) tagging
  • Named-entity recognition (NER)
  • Coreference resolution
  • Parsing
    • Shallow parsing�(a.k.a. chunking)
    • Constituency parsing
    • Dependency parsing

29

30 of 32

NLP pipeline

  • Implemented by Stanford CoreNLP, nltk, spaCy, etc.
  • Sequential model
    • Fixed order of steps
    • Early errors will propagate downstream
    • Fixed order not optimal for all cases (e.g., syntax usually done before semantics, but semantics might be useful for inferring syntax)
  • Hence, current research: learn all tasks jointly (early example)
  • To learn how all this magic is implemented
    • Take CS-431 (Intro to NLP), CS-552 (Modern NLP)

30

31 of 32

Today’s trend: generative language models

  • E.g., OpenAI GPT-*, ChatGPT
  • Input: text
  • Output: text
  • Many NLP tasks can be formulated in this framework, by “prompting” the language model with the right input

31

32 of 32

Feedback

32

Give us feedback on this lecture here: https://go.epfl.ch/ada2025-lec11-feedback

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?