AI Experience Lab
[Word Embeddings]
Ue-Hwan, Kim
Contents
2
Contents
3
4
credit: Stanford CS224n
5
credit: Stanford CS224n
6
credit: Stanford CS224n
7
credit: Stanford CS224n
8
credit: Stanford CS224n
9
credit: Stanford CS224n
How do we represent the meaning of a word?
Definition: meaning (Webster dictionary)
Commonest linguistic way of thinking of meaning:
10
credit: Stanford CS224n
How do we have usable meaning in a computer?
Previously commonest NLP solution: Use, e.g., WordNet, a thesaurus containing lists of synonym sets and hypernyms (“is a” relationships)
11
credit: Stanford CS224n
How do we have usable meaning in a computer?
Previously commonest NLP solution: Use, e.g., WordNet, a thesaurus containing lists of synonym sets and hypernyms (“is a” relationships)
12
credit: Stanford CS224n
Problems with resources like WordNet
A useful resource but missing nuance:
Missing new meanings of words:
Subjective
Requires human labor to create and adapt
Can’t be used to accurately compute word similarity (see following slides)
13
credit: Stanford CS224n
Representing words as discrete symbols
In traditional NLP, we regard words as discrete symbols:
Such symbols for words can be represented by one-hot vectors:
In traditional NLP, we regard words as discrete symbols:
Such symbols for words can be represented by one-hot vectors:
Vector dimension = number of words in vocabulary (e.g., 500,000+)
14
Means one 1, the rest 0s
credit: Stanford CS224n
Problem with words as discrete symbols
Example: in web search, if a user searches for “Seattle motel”, we would like to match documents containing “Seattle hotel”
But:
motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
These two vectors are orthogonal
There is no natural notion of similarity for one-hot vectors!
Solution:
15
credit: Stanford CS224n
Representing words by their context
Distributional semantics: A word’s meaning is given by the words that frequently appear close-by
When a word w appears in a text, its context is the set of words that appear nearby (within a fixed-size window).
We use the many contexts of w to build up a representation of w
16
credit: Stanford CS224n
These context words will represent banking
Poll
Which is not correct?
17
Poll
Which is not correct?
18
Contents
19
Word vectors
We will build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts, measuring similarity as the vector dot (scalar) product
20
credit: Stanford CS224n
Note: word vectors are also called (word) embeddings or (neural) word representations They are a distributed representation
Word meaning as a neural word vector – visualization
21
credit: Stanford CS224n
Word2vec overview
Word2vec (Mikolov et al. 2013) is a framework for learning word vectors
Idea:
22
credit: Stanford CS224n
Word2vec overview
Example windows and process for computing
23
credit: Stanford CS224n
...
problems
crisis
banking
into
turning
as
...
center word at position t
outside context words in window of size 2
outside context words in window of size 2
P(wt-1|wt)
P(wt-2|wt)
P(wt+1|wt)
P(wt+2|wt)
Word2vec overview
Example windows and process for computing
24
credit: Stanford CS224n
...
problems
crisis
banking
into
turning
as
...
center word at position t
outside context words in window of size 2
outside context words in window of size 2
P(wt-1|wt)
P(wt-2|wt)
P(wt+1|wt)
P(wt+2|wt)
Word2vec: objective function
For each position t = 1, ... , T, predict context words within a window of fixed size m, given center word wt. Data likelihood:
25
credit: Stanford CS224n
Minimizing objective function ⟺ Maximizing predictive accuracy
The objective function J(θ) is the (average) negative log likelihood:
Word2vec: objective function
We want to minimize the objective function
26
credit: Stanford CS224n
Word2vec with vectors
Example windows and process for computing P(wt+j|wt)
27
...
problems
crisis
banking
into
turning
as
...
center word at position t
outside context words in window of size 2
outside context words in window of size 2
P(uturning|vinto)
P(uproblems|vinto)
P(ubanking|vinto)
P(ucrisis|vinto)
credit: Stanford CS224n
Word2vec: prediction function
This is an example of the softmax function Rn → (0,1)n
The softmax function maps arbitrary values xi to a probability distribution pi
28
credit: Stanford CS224n
1) Dot product compares similarity of o and c.
1) uTv = Σuivi
1) Larger dot product = larger probability
2) Exponentiation makes anything positive
3) Normalize over entire vocabulary to give probability distribution
Contents
29
To train the model: Optimize value of parameters to minimize loss
To train a model, we gradually adjust parameters to minimize a loss
parameters, in one long vector
and V-many words
walking down the gradient (see right figure)
30
credit: Stanford CS224n
Word2vec derivations of gradient
31
credit: Stanford CS224n
Chain Rule
Chain rule! If y = f(u) and u = g(x), i.e., y = f(g(x)), then:
32
Simple example:
credit: Stanford CS224n
Word2vec derivations of gradient
Our objective function:
33
Gradient for the center word:
credit: Stanford CS224n
Word2vec derivations of gradient
34
credit: Stanford CS224n
Calculating all gradients!
We went through the gradient for each center vector v in a window
We also need gradients for outside vectors u: Do it on your own!
Generally, in each window we will compute updates for all parameters that are being used in that window.
35
credit: Stanford CS224n
Word2vec maximizes objective function by putting similar words nearby in space
36
credit: Stanford CS224n
Word2vec: More details
Why two vectors? → Easier optimization. Average both at the end.
Two model variants:
1. Skip-grams (SG)
Predict context (“outside”) words (position independent) given center word
2. Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
This lecture so far: Skip-gram model
Additional efficiency in training:
1. Negative sampling
So far: Focus on naïve softmax (simpler but more expensive training method)
37
credit: Stanford CS224n
The skip-gram model with negative sampling
The normalization term is computationally expensive
Hence, in standard word2vec, you implement the skip-gram model with negative sampling
Main idea: train binary logistic regressions for a true pair (center word and a word in its context window) versus several “noise” pairs (the center word paired with a random word)
38
credit: Stanford CS224n
The skip-gram model with negative sampling
From paper: “Distributed Representations of Words and Phrases and their Compositionality” (Mikolov et al. 2013)
Overall objective function with negative sampling
39
credit: Stanford CS224n
We take k negative samples (using word probabilities)
Maximize probability that real outside word appears; minimize probability that random words appear around center word
Sample with P(w)=U(w)3/4/Z, the unigram distribution U(w) raised to the 3/4 power.
Stochastic gradients with negative sampling
We iteratively take gradients at each window for SGD
In each window, we only have at most 2m + 1 words plus 2km negative words with negative sampling, so ∇J is very sparse!
40
credit: Stanford CS224n
Stochastic gradients with negative sampling
We might only update the word vectors that actually appear!
Solution: either you need sparse matrix update operations to only update certain rows of full embedding matrices U and V, or you need to keep around a hash for word vectors
If you have millions of word vectors and do distributed computing, it is important to not have to send gigantic updates around!
41
credit: Stanford CS224n
Poll
Which is not correct
42
Poll
Which is not correct
43
Contents
44
Why not capture co-occurrence counts directly?
There’s something weird about iterating through the whole corpus (perhaps many times); why don’t we just accumulate all the statistics of what words appear near each other?!?
Building a co-occurrence matrix X
45
credit: Stanford CS224n
Example: Window based co-occurrence matrix
Window length 1 (more common: 5–10)
Symmetric (irrelevant whether left or right context)
Example corpus:
46
credit: Stanford CS224n
Co-occurrence vectors
Simple count co-occurrence vectors
Low-dimensional vectors
47
credit: Stanford CS224n
Classic method: Dimensionality reduction on X
Singular Value Decomposition of co-occurrence matrix X
Factorizes X into UΣVT, where U and V are orthonormal
48
credit: Stanford CS224n
Retain only k singular values, in order to generalize.
M is the best rank k approximation to X , in terms of least squares.
Classic linear algebra result. Expensive to compute for large matrices.
Hacks to X (several used in Rohde et al. 2005 in COALS)
Running an SVD on raw counts doesn’t work well!!!
Scaling the counts in the cells can help a lot
Ramped windows that count closer words more than further away words
Use Pearson correlations instead of counts, then set negative values to 0
49
credit: Stanford CS224n
Scaling the counts in the cells can help a lot
Problem: function words (the, he, has) are too frequent → syntax has too much impact
Pointwise mutual information (PMI)
50
credit: Deep learning from scratch
Interesting semantic patterns emerge in the scaled vectors
51
credit: Stanford CS224n
Towards GloVe: Count based vs. direct prediction
52
LSA, HAL (Lund & Burgess),
COALS, Hellinger-PCA (Rohde
et al, Lebret & Collobert)
Skip-gram/CBOW (Mikolov et al)
NNLM, HLBL, RNN (Bengio et al; Collobert & Weston; Huang et al; Mnih & Hinton)
credit: Stanford CS224n
Encoding meaning components in vector differences
Crucial insight: Ratios of co-occurrence probabilities can encode meaning components
53
credit: Stanford CS224n
Encoding meaning components in vector differences
Q: How can we capture ratios of co-occurrence probabilities as linear meaning components in a word vector space (i.e., embedding ratios of co-occurrence probabilities)?
A: Log-bilinear model
54
with vector differences
credit: Stanford CS224n
Combining the best of both worlds: GloVe
Fast training
Scalable to huge corpora
Good performance even with small corpus and small vectors
55
credit: Stanford CS224n
GloVe results
Nearest words to frog:
1. frogs
2. toad
3. litoria
4. leptodactylidae
5. rana
6. lizard
7. eleutherodactylus
56
credit: Stanford CS224n
Poll
Which is correct?
57
Poll
Which is correct?
58
Contents
59
How to evaluate word vectors?
Related to general evaluation in NLP: Intrinsic vs. extrinsic
Intrinsic:
Extrinsic:
60
credit: Stanford CS224n
Intrinsic word vector evaluation
Word Vector Analogies ⇒ man:woman :: king:?
Problem: What if the information is there but not linear?
61
credit: Stanford CS224n
Intrinsic word vector evaluation
Korean word2vec: http://w.elnn.kr/search/
62
credit: Stanford CS224n
Another intrinsic word vector evaluation
Word vector distances and their correlation with human judgments
Example dataset: WordSim353 http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/
63
credit: Stanford CS224n
Extrinsic word vector evaluation
One example where good word vectors should help directly: named entity recognition
64
credit: Stanford CS224n
Contents
65
Word senses and word sense ambiguity
Most words have lots of meanings!
Example: pike
Does one vector capture all these meanings or do we have a mess?
66
credit: Stanford CS224n
pike
67
credit: Stanford CS224n
Improving Word Representations Via Global Context And Multiple Word Prototypes (Huang et al. 2012)
• Idea: Cluster word windows around words, retrain with each word assigned to multiple different clusters bank1, bank2, etc.
68
credit: Stanford CS224n
Contents
69
Summary
Today we
70