1 of 48

Vector Semantics & Embeddings

Words and Vectors

Slides are adapted from Jurafsky Ch6

https://web.stanford.edu/~jurafsky/slp3/

Word -> Vector

Cat -> [5, 13, ….., 102]

2 of 48

Vectorization Approaches

Statistical vector

Counting based
Similarity score
One-hot
TF-IDF

Learned vector (Neural Language Model)

Word2vec (skipgram, CBOW), GloVe
Contextual Embeddings (ELMo, BERT)

3 of 48

Term-document matrix

Each document is represented by a vector of words

4 of 48

Visualizing document vectors

5 of 48

Vectors are the basis of information retrieval

Vectors are similar for the two comedies

But comedies are different than the other two

Comedies have more fools and wit and fewer battles.

6 of 48

Idea for word meaning: Words can be vectors too!!!

battle is "the kind of word that occurs in Julius Caesar and Henry V"

fool is "the kind of word that occurs in comedies, especially Twelfth Night"

7 of 48

More common: word-word matrix�(or "term-context matrix")

Two words are similar in meaning if their context vectors are similar

7

An alternative to using the term-document matrix to represent words as vectors of document counts, is to use the term-term matrix, also called the word-word matrix or the term-context matrix, in which the columns are labeled by words rather than documents. This matrix is thus of dimensionality |V | × |V | and each cell records the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus.

The context could be the document, in which case the cell represents the number of times the two words appear in the same document. It is most common, however, to use smaller contexts, generally a window around the word, for example of 4 words to the left and 4 words to the right, in which case the cell represents the number of times (in some training corpus) the column word occurs in such a ±4 word window around the row word. For example here is one example each of some words in their windows with counts from a Wikipedia corpus. Note that digital and information are more similar to each other than, say, to strawberry.

8 of 48

9 of 48

Cosine for computing word similarity

Can we replace the counting with a similarity score?

10 of 48

Computing word similarity: Dot product and cosine

The dot product between two vectors is a scalar:

The dot product tends to be high when the two vectors have large values in the same dimensions

Dot product can thus be a useful similarity metric between vectors

11 of 48

Alternative: cosine for computing word similarity

Based on the definition of the dot product between two vectors a and b

12 of 48

Cosine as a similarity metric

-1: vectors point in opposite directions

+1: vectors point in same directions

0: vectors are orthogonal

But since raw frequency values are non-negative, the cosine for term-term matrix vectors ranges from 0–1

12

13 of 48

Cosine examples

13

	pie	data	computer
cherry	442	8	2
digital	5	1683	1670
information	5	3982	3325

14 of 48

Visualizing cosines �(well, angles)

15 of 48

Pop-up quiz

16 of 48

How will you interpret the results?

V= “Cat”, W=“Dog”, Z= “The”

17 of 48

TF-IDF

TF-IDF

The co-occurrence matrices we have seen represent each cell by word frequencies.
Frequency is clearly useful; if sugar appears a lot near apricot, that's useful information.
But overly frequent words like the, it, or they are not very informative about the context
It's a paradox! How can we balance these two conflicting constraints?

18 of 48

Two common solutions for word weighting

Words like "the" or "it" have very low idf

See if words like "good" appear more often with "great" than we would expect by chance

19 of 48

Document frequency (df)

df_t is the number of documents t occurs in.

(note this is not collection frequency: total count across all documents)

"Romeo" is very distinctive for one Shakespeare play:

The second factor in tf-idf is used to give a higher weight to words that occur only in a few documents. Terms that are limited to a few documents are useful for discriminating those documents from the rest of the collection; terms that occur frequently across the entire collection aren’t as helpful. The document frequency dft of a term t is the number of documents it occurs in .

Document frequency is not the same as the collection frequency of a term, which is the total number of times the word appears in the whole collection in any document. Consider in the collection of Shakespeare’s 37 plays the two words Romeo and action. The words have identical collection frequencies (they both occur 113 times in all the plays) but very different document frequencies, since Romeo only occurs in a single play. If our goal is to find documents about the romantic tribulations of Romeo, the word Romeo should be highly weighted, but not action:

Important: documents can be anything, they don't have to be original documents. For example we often treat each paragraph as a document, which lets use compute tf-idf values for

20 of 48

Inverse document frequency (idf)

N is the total number of documents

in the collection

We emphasize discriminative words like Romeo via the inverse document fre- quency or idf term weight (Sparck Jones, 1972). The idf is defined using the frac- tion N/dft, where N is the total number of documents in the collection, and dft is the number of documents in which term t occurs. The fewer documents in which a term occurs, the higher this weight. The lowest weight of 1 is assigned to terms that occur in all the documents. Because of the large number of documents in many collections, this measure too is usually squashed with a log function. The resulting definition for inverse document frequency (idf) is thus

Here are some idf values for some words in the Shakespeare corpus, ranging from extremely informative words which occur in only one play like Romeo, to those that occur in a few like salad or Falstaff, to those which are very common like fool or so common as to be completely non-discriminative since they occur in all 37 plays like good or sweet.

21 of 48

Final tf-idf weighted value for a word

Raw counts:

tf-idf:

22 of 48

Learned Vectors

Learned Vectors

tf-idf (or PMI) vectors are

-long (length |V|= 20,000 to 50,000)

-sparse (most elements are zero)

Alternative: learn vectors which are

-short (length 50-1000)

-dense (most elements are non-zero)

23 of 48

Sparse versus dense vectors

Why dense vectors?

Short vectors may be easier to use as features in machine learning (fewer weights to tune)
Dense vectors may generalize better than explicit counts
Dense vectors may do better at capturing synonymy:

car and automobile are synonyms; but are distinct dimensions

a word with car as a neighbor and a word with automobile as a neighbor should be similar, but aren't

In practice, they work better

23

24 of 48

Common methods for getting short dense vectors

“Neural Language Model”-inspired models

Word2vec (skipgram, CBOW), GloVe

Singular Value Decomposition (SVD)

A special case of this is called LSA – Latent Semantic Analysis

Alternative to these "static embeddings":

Contextual Embeddings (ELMo, BERT)
Compute distinct embeddings for a word in its context
Separate embeddings for each token of a word

25 of 48

Simple static embeddings you can download!

Word2vec (Mikolov et al)

https://code.google.com/archive/p/word2vec/

GloVe (Pennington, Socher, Manning)

http://nlp.stanford.edu/projects/glove/

26 of 48

Word2vec

Popular embedding method

Very fast to train

Code available on the web

Idea: predict rather than count

Word2vec provides various options. We'll do:

skip-gram with negative sampling (SGNS)

�

27 of 48

Word2vec

Instead of counting how often each word w occurs near "apricot"

Train a classifier on a binary prediction task:

Is w likely to show up near "apricot"?

We don’t actually care about this task

But we'll take the learned classifier weights as the word embeddings

Big idea: self-supervision:

A word c that occurs near apricot in the corpus cats as the gold "correct answer" for supervised learning
No need for human labels
Bengio et al. (2003); Collobert et al. (2011)

The intuition of word2vec is that instead of counting how often each word w oc- curs near, say, apricot, we’ll instead train a classifier on a binary prediction task: “Is word w likely to show up near apricot?” We don’t actually care about this prediction task; instead we’ll take the learned classifier weights as the word embeddings. The revolutionary intuition here is that we can just use running text as implicitly supervised training data for such a classifier; a word c that occurs near the target word apricot acts as gold ‘correct answer’ to the question “Is word c likely to show up near apricot?” This method, often called self-supervision, avoids the need for any sort of hand-labeled supervision signal. This idea was first proposed in the task of neural language modeling, when Bengio et al. (2003) and Collobert et al. (2011) showed that a neural language model (a neural network that learned to predict the next word from prior words) could just use the next word in running text as its supervision signal, and could be used to learn an embedding representation for each word as part of doing this prediction task. word2vec is a much simpler model than the neural network language model, in two ways. First, word2vec simplifies the task (making it binary classification instead of word pre- diction). Second, word2vec simplifies the architecture (training a logistic regression classifier instead of a multi-layer neural network with hidden layers that demand more sophisticated training algorithms)

28 of 48

Approach: predict if candidate word c is a "neighbor"

Treat the target word t and a neighboring context word c as positive examples.
Randomly sample other words in the lexicon to get negative examples
Use logistic regression to train a classifier to distinguish those two cases
Use the learned weights as the embeddings

29 of 48

Skip-Gram Training Data

Assume a +/- 2 word window, given training sentence:

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 c3 c4

[target]

30 of 48

Skip-Gram Classifier

(assuming a +/- 2 word window)

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 [target] c3 c4

Goal: train a classifier that is given a candidate (word, context) pair

(apricot, jam)

(apricot, aardvark)

…

And assigns each pair a probability:

P(+|w, c)

P(−|w, c) = 1 − P(+|w, c)

31 of 48

Similarity is computed from dot product

Remember: two vectors are similar if they have a high dot product

Cosine is just a normalized dot product

So:

Similarity(w,c) ∝ w∙ c

We’ll need to normalize to get a probability

(cosine isn't a probability either)

31

32 of 48

Turning dot products into probabilities

Sim(w,c) ≈ w ∙ c

To turn this into a probability

We'll use the sigmoid from logistic regression:

33 of 48

How Skip-Gram Classifier computes P(+|w, c)

This is for one context word, but we have lots of context words.

We'll assume independence and just multiply them:

34 of 48

Skip-gram classifier: summary

A probabilistic classifier, given

a test target word w
its context window of L words c_1:_L

Estimates probability that w occurs in this window based on similarity of w (embeddings) to c_1:_L (embeddings).

To compute this, we just need embeddings for all the words.

35 of 48

These embeddings we'll need: a set for w, a set for c

36 of 48

Learned Vector

Word2vec: Learning the embeddings

37 of 48

Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 [target] c3 c4

37

38 of 48

Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 [target] c3 c4

38

For each positive example we'll grab k negative examples, sampling by frequency

39 of 48

Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 [target] c3 c4

39

40 of 48

Word2vec: how to learn vectors

Given the set of positive and negative training instances, and an initial set of embedding vectors

The goal of learning is to adjust those word vectors such that we:

Maximize the similarity of the target word, context word pairs (w , c_pos) drawn from the positive data
Minimize the similarity of the (w , c_neg) pairs drawn from the negative data.

40

9/26/2024

41 of 48

Loss function for one w with c_pos , c_neg₁ ...c_negk

Maximize the similarity of the target with the actual context words, and minimize the similarity of the target with the k negative sampled non-neighbor words.

If we consider one word/context pair (w, cpos ) with its k noise words cneg1 ...cnegk , our goal is to maximize the similarity of the target with the actual context words, andn minimize the similarity of the target with the k negative sampled non-neighbor words. We can express these two goals as the following loss function L to be minimized (hence the −); here the first term expresses that we want the classifier to assign the real context word cpos a high probability of being a neighbor, and the second term expresses that we want to assign each of the noise words cnegi a high probability of being a non-neighbor, all multiplied because we assume independence:

If we substitute in our sigmoid of the dot product estimate for P, we want to maximize the dot product of the word with the actual context words, and minimize the dot products of the word with the k negative sampled non- neighbor words.

􏰼

42 of 48

Learning the classifier

How to learn?

Stochastic gradient descent!

We’ll adjust the word weights to

make the positive pairs more likely
and the negative pairs less likely,
over the entire training set.

43 of 48

Intuition of one step of gradient descent

44 of 48

Reminder: gradient descent

45 of 48

The derivatives of the loss function

46 of 48

Update equation in SGD

Start with randomly initialized C and W matrices, then incrementally do updates

47 of 48

Two sets of embeddings

SGNS learns two sets of embeddings

Target embeddings matrix W

Context embedding matrix C

It's common to just add them together, representing word i as the vector w_i+ c_i

48 of 48

Summary: How to learn word2vec (skip-gram) embeddings

Start with V random d-dimensional vectors as initial embeddings

Train a classifier based on embedding similarity

Take a corpus and take pairs of words that co-occur as positive examples
Take pairs of words that don't co-occur as negative examples
Train the classifier to distinguish these by slowly adjusting all the embeddings to improve the classifier performance
Throw away the classifier code and keep the embeddings.