1 of 48

Vector Semantics & Embeddings

  • Words and Vectors

Slides are adapted from Jurafsky Ch6

https://web.stanford.edu/~jurafsky/slp3/

Word -> Vector

Cat -> [5, 13, ….., 102]

2 of 48

Vectorization Approaches

Statistical vector

    • Counting based
    • Similarity score
    • One-hot
    • TF-IDF

Learned vector (Neural Language Model)

    • Word2vec (skipgram, CBOW), GloVe
    • Contextual Embeddings (ELMo, BERT)

3 of 48

Term-document matrix

Each document is represented by a vector of words

4 of 48

Visualizing document vectors

5 of 48

Vectors are the basis of information retrieval

Vectors are similar for the two comedies

But comedies are different than the other two

Comedies have more fools and wit and fewer battles.

6 of 48

Idea for word meaning: Words can be vectors too!!!

battle is "the kind of word that occurs in Julius Caesar and Henry V"

fool is "the kind of word that occurs in comedies, especially Twelfth Night"

7 of 48

More common: word-word matrix�(or "term-context matrix")

Two words are similar in meaning if their context vectors are similar

7

8 of 48

9 of 48

Cosine for computing word similarity

  • Can we replace the counting with a similarity score?

10 of 48

Computing word similarity: Dot product and cosine

The dot product between two vectors is a scalar:

The dot product tends to be high when the two vectors have large values in the same dimensions

Dot product can thus be a useful similarity metric between vectors

 

 

11 of 48

Alternative: cosine for computing word similarity

Based on the definition of the dot product between two vectors a and b

12 of 48

Cosine as a similarity metric

-1: vectors point in opposite directions

+1: vectors point in same directions

0: vectors are orthogonal

But since raw frequency values are non-negative, the cosine for term-term matrix vectors ranges from 0–1

12

13 of 48

Cosine examples

13

pie

data

computer

cherry

442

8

2

digital

5

1683

1670

information

5

3982

3325

14 of 48

Visualizing cosines �(well, angles)

15 of 48

Pop-up quiz

 

 

16 of 48

How will you interpret the results?

 

 

V= “Cat”, W=“Dog”, Z= “The”

17 of 48

TF-IDF

  • TF-IDF
  • The co-occurrence matrices we have seen represent each cell by word frequencies.
  • Frequency is clearly useful; if sugar appears a lot near apricot, that's useful information.
  • But overly frequent words like the, it, or they are not very informative about the context
  • It's a paradox! How can we balance these two conflicting constraints?

18 of 48

Two common solutions for word weighting

 

Words like "the" or "it" have very low idf

See if words like "good" appear more often with "great" than we would expect by chance

19 of 48

Document frequency (df)

dft is the number of documents t occurs in.

(note this is not collection frequency: total count across all documents)

"Romeo" is very distinctive for one Shakespeare play:

20 of 48

Inverse document frequency (idf)

N is the total number of documents

in the collection

21 of 48

Final tf-idf weighted value for a word

Raw counts:

tf-idf:

22 of 48

Learned Vectors

  • Learned Vectors

tf-idf (or PMI) vectors are

-long (length |V|= 20,000 to 50,000)

-sparse (most elements are zero)

Alternative: learn vectors which are

-short (length 50-1000)

-dense (most elements are non-zero)

23 of 48

Sparse versus dense vectors

Why dense vectors?

    • Short vectors may be easier to use as features in machine learning (fewer weights to tune)
    • Dense vectors may generalize better than explicit counts
    • Dense vectors may do better at capturing synonymy:
      • car and automobile are synonyms; but are distinct dimensions
        • a word with car as a neighbor and a word with automobile as a neighbor should be similar, but aren't
    • In practice, they work better

23

24 of 48

Common methods for getting short dense vectors

“Neural Language Model”-inspired models

    • Word2vec (skipgram, CBOW), GloVe

Singular Value Decomposition (SVD)

    • A special case of this is called LSA – Latent Semantic Analysis

Alternative to these "static embeddings":

    • Contextual Embeddings (ELMo, BERT)
    • Compute distinct embeddings for a word in its context
    • Separate embeddings for each token of a word

25 of 48

Simple static embeddings you can download!

Word2vec (Mikolov et al)

https://code.google.com/archive/p/word2vec/

GloVe (Pennington, Socher, Manning)

http://nlp.stanford.edu/projects/glove/

26 of 48

Word2vec

Popular embedding method

Very fast to train

Code available on the web

Idea: predict rather than count

Word2vec provides various options. We'll do:

skip-gram with negative sampling (SGNS)

27 of 48

Word2vec

Instead of counting how often each word w occurs near "apricot"

    • Train a classifier on a binary prediction task:
      • Is w likely to show up near "apricot"?

We don’t actually care about this task

      • But we'll take the learned classifier weights as the word embeddings

Big idea: self-supervision:

      • A word c that occurs near apricot in the corpus cats as the gold "correct answer" for supervised learning
      • No need for human labels
      • Bengio et al. (2003); Collobert et al. (2011)

28 of 48

Approach: predict if candidate word c is a "neighbor"

  1. Treat the target word t and a neighboring context word c as positive examples.
  2. Randomly sample other words in the lexicon to get negative examples
  3. Use logistic regression to train a classifier to distinguish those two cases
  4. Use the learned weights as the embeddings

29 of 48

Skip-Gram Training Data

Assume a +/- 2 word window, given training sentence:

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 c3 c4

[target]

30 of 48

Skip-Gram Classifier

(assuming a +/- 2 word window)

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 [target] c3 c4

Goal: train a classifier that is given a candidate (word, context) pair

(apricot, jam)

(apricot, aardvark)

And assigns each pair a probability:

P(+|w, c)

P(−|w, c) = 1 − P(+|w, c)

31 of 48

Similarity is computed from dot product

Remember: two vectors are similar if they have a high dot product

    • Cosine is just a normalized dot product

So:

    • Similarity(w,c) ∝ w ∙ c

We’ll need to normalize to get a probability

    • (cosine isn't a probability either)

31

32 of 48

Turning dot products into probabilities

Sim(w,c) w ∙ c

To turn this into a probability

We'll use the sigmoid from logistic regression:

33 of 48

How Skip-Gram Classifier computes P(+|w, c)

This is for one context word, but we have lots of context words.

We'll assume independence and just multiply them:

34 of 48

Skip-gram classifier: summary

A probabilistic classifier, given

    • a test target word w
    • its context window of L words c1:L

Estimates probability that w occurs in this window based on similarity of w (embeddings) to c1:L (embeddings).

To compute this, we just need embeddings for all the words.

35 of 48

These embeddings we'll need: a set for w, a set for c

36 of 48

Learned Vector

  • Word2vec: Learning the embeddings

37 of 48

Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 [target] c3 c4

37

38 of 48

Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 [target] c3 c4

38

For each positive example we'll grab k negative examples, sampling by frequency

39 of 48

Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 [target] c3 c4

39

40 of 48

Word2vec: how to learn vectors

Given the set of positive and negative training instances, and an initial set of embedding vectors

The goal of learning is to adjust those word vectors such that we:

    • Maximize the similarity of the target word, context word pairs (w , cpos) drawn from the positive data
    • Minimize the similarity of the (w , cneg) pairs drawn from the negative data.

40

9/26/2024

41 of 48

Loss function for one w with cpos , cneg1 ...cnegk

Maximize the similarity of the target with the actual context words, and minimize the similarity of the target with the k negative sampled non-neighbor words.

42 of 48

Learning the classifier

How to learn?

    • Stochastic gradient descent!

We’ll adjust the word weights to

    • make the positive pairs more likely
    • and the negative pairs less likely,
    • over the entire training set.

43 of 48

Intuition of one step of gradient descent

44 of 48

Reminder: gradient descent

 

45 of 48

The derivatives of the loss function

46 of 48

Update equation in SGD

Start with randomly initialized C and W matrices, then incrementally do updates

47 of 48

Two sets of embeddings

SGNS learns two sets of embeddings

Target embeddings matrix W

Context embedding matrix C

It's common to just add them together, representing word i as the vector wi + ci

48 of 48

Summary: How to learn word2vec (skip-gram) embeddings

Start with V random d-dimensional vectors as initial embeddings

Train a classifier based on embedding similarity

    • Take a corpus and take pairs of words that co-occur as positive examples
    • Take pairs of words that don't co-occur as negative examples
    • Train the classifier to distinguish these by slowly adjusting all the embeddings to improve the classifier performance
    • Throw away the classifier code and keep the embeddings.