1 of 41

Assignment 3 and Vector Embeddings

CSE 447 / 517

Feb 6TH, 2025 (WEEK 5)

2 of 41

Logistics

  • Project Checkpoint 2 is due on Monday, 2/10
  • Assignment 3 (A3) is due on Wednesday, 2/12

3 of 41

Agenda

  • Assignment 3 Preview
  • Vector Embeddings
    • Static word embeddings
    • Contextualized word embeddings
  • Interactive Word Embeddings

4 of 41

Assignment 3 Preview

  • Geometry of Word Embeddings
  • Analogies
  • Bias in Word Embeddings
  • From Word Embeddings to Sentence Level Embeddings
  • KNN Classifier using Glove-Based Sentence Embeddings

5 of 41

Geometry of Word Embeddings

  • Cosine Similarity
    • Formula on the right
  • Find Synonym
    • Cosine metric
      • take the choice_embedding closest to the word embedding
      • notice when to use argmax for cosine metric and argmin for others

6 of 41

Analogies

  • Find analogy word
    • Linear direction is from embeddings of a to aa
    • Analogy_vector offsets embeddings of b with linear direction

7 of 41

Bias in Word Embeddings

  • Word association with attribute
    • Cos Similarity of a with word embedding: mean(cos_similarity(word_embedding, for all a_embeddings))
    • Association is the difference between the a’s and b’s cos similarity

8 of 41

Bias in Word Embeddings

  • Word Embedding Association Test
    • Sets of words
      • X: flower names
      • Y: insect names
    • Attribute words
      • A: pleasant terms
      • B unpleasant terms
    • Diff association: (association of X – Y).item()

9 of 41

From Word Embeddings to Sentence Level Embeddings

  • Get Sentence Embedding
    • Tokens: word_tokenize(sentence) if token is in embeddings
    • Use_POS
      • pos_tags = nltk.pos_tag(tokens) remember to use tag at index 1
      • Sentence_embedding is the weighted sum of embeddings
      • notice the shape of pos_weights: pos_weights.view(-1,1)

10 of 41

KNN Classifier using Glove-Based Sentence Embeddings

  • Fit
    • Embedding of train X: stack sentence embedding for all x
  • Predict
    • Similarity is between embeddings of train X and test X
    • Indice of KNN: topk(Similarity, k), remember to use the correct dim=1
    • Labels of KNN: y[Indice]
    • Most common label: use torch.mode function

11 of 41

Word Embeddings: A Quick Review

  • Motivation:
    • Represent words in a computationally efficient and semantically meaningful way
  • Evaluation:
    • Intrinsic: word similarities, TOEFL-like synonyms, analogies, etc.
    • Extrinsic: do the embeddings improve system performance?
  • Using embeddings in your model:
    • Freeze embeddings and use as-is in your model
    • Fine-tune embeddings, updating them as you train

12 of 41

Man-woman relations in embeddings

13 of 41

Comparative-superlative relations in embeddings

14 of 41

Distributional Hypothesis, again

  • A word’s meaning is given by words that appear frequently close by
  • When a word w appears in text, its context is the set of words that appear nearby (in some window).
  • Dense Vectors From 10,000 feet:
    • Find a bunch of times that w occurs in text.
    • Use the many contexts of w to build a vector.

These context words define banking.

15 of 41

Dense Word Vectors

  • Let’s assign each word a dense word vector
  • But each word’s vector should be similar to vectors of words that appear in similar contexts.
  • Example:

0.281

0.271

-0.121

0.129

0.110

0.930

U.S. =

0.312

Washington =

0.311

grass =

0.121

-1.29

-1.33

1.53

-0.21

-0.11

-0.51

If words appear in similar contexts, they have similar vectors!

16 of 41

“U.S.” and “Washington” occur in similar contexts!

17 of 41

"Static" Word Embeddings

Each word maps to a single vector, based on their occurrence with other words in a large corpus.

Connects to LSA/I, parallels to LMs

Examples of popular pretrained word embeddings:

  • word2vec: Trained on Google News
  • GloVe: Trained on Wikipedia, Gigaword, Common Crawl, or Twitter
  • FastText: Trained on Wikipedia or Common Crawl

18 of 41

Word2Vec: Overview

  • Word2Vec is a framework for learning word vectors. Basic Idea:
  • We have a large corpus of text.
  • Every word in a fixed vocabulary is assigned a vector.
  • Go through each position t in the text, which has a center word c and outside (context) words o.
  • Use the similarity of the word vectors for c and o to calculate the probability of o given c.
  • Training: Continuously adjust the word vectors to maximize this probability.

19 of 41

Word2Vec: Overview

  • Example for computing P(wt+j | wt )

20 of 41

Word2Vec: Overview

  • Example for computing P(wt+j | wt )

21 of 41

Word2Vec: Loss Function

  • For each position t = 1 … T, predict context words within a fixed-size window of size m, given the center word wt
  • Likelihood (θ = parameters of the model, or things we want to optimize):

For each position in the text.

For each word within the window

Probability of word in window given center word.

22 of 41

Word2Vec: Loss Function

  • Loss function J: Averaged negative log-likelihood
    • Work in logspace!
    • Negative to turn the problem from a maximization problem into a minimization problem
  • If we minimize the loss function J, then we maximize the predictive accuracy!

23 of 41

Word2Vec: Loss Function

  • Question: How do we calculate P(wt+j | wt ) ?
  • Answer: Use two vectors per word w.
    • Use the vector vw when w is the center word.
    • Use the vector uw when w is the context word.
  • Thus, for a center word c and a context word o:
  • Look familiar?

24 of 41

Word2Vec: Now with Vectors!

  • Example for computing P(wt+j | wt )

25 of 41

Word2Vec: Now with Vectors!

  • Example for computing P(wt+j | wt )

26 of 41

Word2Vec: Why this prediction function?

  • Softmax shows up again.
  • We can train this with gradient descent.
  • This model puts words that frequently co-occur nearby in vector space (to maximize the dot product).

27 of 41

Clusters of dense word vectors

28 of 41

Why separate center and context vectors?

  • Why use two vectors (one for when the word is the context, one for when the word is the center)?
    • Makes optimization/training easier in practice.
    • Our final word vector is traditionally average of the context and center vector for a word.

29 of 41

Why separate center and context vectors?

  • Another angle:

30 of 41

Two Variants of Word2Vec

  1. SkipGram (what we’ve seen so far): Predict context (outside) words given the center word.
  2. CBOW: Predict center word from the sum of surrounding word vectors.

31 of 41

CBOW in practice

32 of 41

Skipgram is like the reverse of CBOW?

33 of 41

Okay, okay just kidding, here's the real SkipGram diagram:

34 of 41

Contextualized Word Embeddings

Premise: define a vector for each token based its context in the data

  • How do we get context? RNN-based Neural LM’s
    • Hidden state hi at timestep i represents the left-context of token xi
    • Compute an analogous right-context by training a right-to-left LM
    • Simplest approach: concatenate the two contexts to get an embedding

35 of 41

Contextualized Word Embeddings

ELMo (Peters et al., 2018)

  • Used a multi-layer, bidirectional LSTM
  • Using ELMo instead of static vectors: instant SOTA on a lot of benchmark tasks

36 of 41

ELMo, visually

37 of 41

BERT

BERT (Devlin et al., 2019) :

  • Instead of RNN, it uses transformers.
  • Learning objectives:
    • Masked Language Model (MLM): randomly mask out words for model to predict.
    • Next Sentence Prediction (NSP): given a pair of sentences, does the second sentence follow the first one? Helpful for understanding the relationship between sentences (for QA, NLI, etc.).

38 of 41

BERT

Pretrain + finetune like we discussed!

BERT’s Performance on GLUE tasks (Devlin et al., 2019)

39 of 41

BERTology

  • Many many ideas are built on BERT:

  • Multilingual BERT (Devlin et al., 2019):
    • pretrained on 104 language.
  • RoBERTa (Liu et al., 2019):
    • removed NSP objective;
    • trained with larger mini-batches
    • larger learning rates;
    • more data;
    • longer pretraining time.
  • Overview: Rogers et al. (2020)
  • T5 (Raffel et al., 2019): model that explored many different options

40 of 41

Interactive Word Embeddings

41 of 41

Questions?

  • Thank you!