Vector Semantics & Embeddings
Slides are adapted from Jurafsky Ch6
https://web.stanford.edu/~jurafsky/slp3/
Word -> Vector
Cat -> [5, 13, ….., 102]
Vectorization Approaches
Statistical vector
Learned vector (Neural Language Model)
Term-document matrix
Each document is represented by a vector of words
Visualizing document vectors
Vectors are the basis of information retrieval
Vectors are similar for the two comedies
But comedies are different than the other two
Comedies have more fools and wit and fewer battles.
Idea for word meaning: Words can be vectors too!!!
battle is "the kind of word that occurs in Julius Caesar and Henry V"
fool is "the kind of word that occurs in comedies, especially Twelfth Night"
More common: word-word matrix�(or "term-context matrix")
Two words are similar in meaning if their context vectors are similar
7
Cosine for computing word similarity
Computing word similarity: Dot product and cosine
The dot product between two vectors is a scalar:
The dot product tends to be high when the two vectors have large values in the same dimensions
Dot product can thus be a useful similarity metric between vectors
Alternative: cosine for computing word similarity
Based on the definition of the dot product between two vectors a and b
Cosine as a similarity metric
-1: vectors point in opposite directions
+1: vectors point in same directions
0: vectors are orthogonal
But since raw frequency values are non-negative, the cosine for term-term matrix vectors ranges from 0–1
12
Cosine examples
13
| pie | data | computer |
cherry | 442 | 8 | 2 |
digital | 5 | 1683 | 1670 |
information | 5 | 3982 | 3325 |
Visualizing cosines �(well, angles)
Pop-up quiz
How will you interpret the results?
V= “Cat”, W=“Dog”, Z= “The”
TF-IDF
Two common solutions for word weighting
Words like "the" or "it" have very low idf
See if words like "good" appear more often with "great" than we would expect by chance
Document frequency (df)
dft is the number of documents t occurs in.
(note this is not collection frequency: total count across all documents)
"Romeo" is very distinctive for one Shakespeare play:
Inverse document frequency (idf)
N is the total number of documents
in the collection
Final tf-idf weighted value for a word
Raw counts:
tf-idf:
Learned Vectors
tf-idf (or PMI) vectors are
-long (length |V|= 20,000 to 50,000)
-sparse (most elements are zero)
Alternative: learn vectors which are
-short (length 50-1000)
-dense (most elements are non-zero)
Sparse versus dense vectors
Why dense vectors?
23
Common methods for getting short dense vectors
“Neural Language Model”-inspired models
Singular Value Decomposition (SVD)
Alternative to these "static embeddings":
Simple static embeddings you can download!
Word2vec (Mikolov et al)
https://code.google.com/archive/p/word2vec/
GloVe (Pennington, Socher, Manning)
http://nlp.stanford.edu/projects/glove/
Word2vec
Popular embedding method
Very fast to train
Code available on the web
Idea: predict rather than count
Word2vec provides various options. We'll do:
skip-gram with negative sampling (SGNS)
�
Word2vec
Instead of counting how often each word w occurs near "apricot"
We don’t actually care about this task
Big idea: self-supervision:
Approach: predict if candidate word c is a "neighbor"
Skip-Gram Training Data
Assume a +/- 2 word window, given training sentence:
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
[target]
Skip-Gram Classifier
(assuming a +/- 2 word window)
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 [target] c3 c4
Goal: train a classifier that is given a candidate (word, context) pair
(apricot, jam)
(apricot, aardvark)
…
And assigns each pair a probability:
P(+|w, c)
P(−|w, c) = 1 − P(+|w, c)
Similarity is computed from dot product
Remember: two vectors are similar if they have a high dot product
So:
We’ll need to normalize to get a probability
31
Turning dot products into probabilities
Sim(w,c) ≈ w ∙ c
To turn this into a probability
We'll use the sigmoid from logistic regression:
How Skip-Gram Classifier computes P(+|w, c)
This is for one context word, but we have lots of context words.
We'll assume independence and just multiply them:
Skip-gram classifier: summary
A probabilistic classifier, given
Estimates probability that w occurs in this window based on similarity of w (embeddings) to c1:L (embeddings).
To compute this, we just need embeddings for all the words.
These embeddings we'll need: a set for w, a set for c
Learned Vector
Skip-Gram Training data
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 [target] c3 c4
37
Skip-Gram Training data
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 [target] c3 c4
38
For each positive example we'll grab k negative examples, sampling by frequency
Skip-Gram Training data
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 [target] c3 c4
39
Word2vec: how to learn vectors
Given the set of positive and negative training instances, and an initial set of embedding vectors
The goal of learning is to adjust those word vectors such that we:
40
9/26/2024
Loss function for one w with cpos , cneg1 ...cnegk
Maximize the similarity of the target with the actual context words, and minimize the similarity of the target with the k negative sampled non-neighbor words.
Learning the classifier
How to learn?
We’ll adjust the word weights to
Intuition of one step of gradient descent
Reminder: gradient descent
The derivatives of the loss function
Update equation in SGD
Start with randomly initialized C and W matrices, then incrementally do updates
Two sets of embeddings
SGNS learns two sets of embeddings
Target embeddings matrix W
Context embedding matrix C
It's common to just add them together, representing word i as the vector wi + ci
Summary: How to learn word2vec (skip-gram) embeddings
Start with V random d-dimensional vectors as initial embeddings
Train a classifier based on embedding similarity