10 of 70

How do we represent the meaning of a word?

Definition: meaning (Webster dictionary)

the idea that is represented by a word, phrase, etc.
the idea that a person wants to express by using words, signs, etc.
the idea that is expressed in a work of writing, art, etc.

Commonest linguistic way of thinking of meaning:

credit: Stanford CS224n

11 of 70

How do we have usable meaning in a computer?

Previously commonest NLP solution: Use, e.g., WordNet, a thesaurus containing lists of synonym sets and hypernyms (“is a” relationships)

credit: Stanford CS224n

12 of 70

How do we have usable meaning in a computer?

Previously commonest NLP solution: Use, e.g., WordNet, a thesaurus containing lists of synonym sets and hypernyms (“is a” relationships)

credit: Stanford CS224n

13 of 70

Problems with resources like WordNet

A useful resource but missing nuance:

e.g., “proficient” is listed as a synonym for “good”: This is only correct in some contexts
Also, WordNet list offensive synonyms in some synonym sets without any coverage of the connotations or appropriateness of words

Missing new meanings of words:

e.g., wicked, badass, nifty, wizard, genius, ninja, bombest
Impossible to keep up-to-date!

Subjective

Requires human labor to create and adapt

Can’t be used to accurately compute word similarity (see following slides)

credit: Stanford CS224n

14 of 70

Representing words as discrete symbols

In traditional NLP, we regard words as discrete symbols:

hotel, conference, motel – a localist representation

Such symbols for words can be represented by one-hot vectors:

In traditional NLP, we regard words as discrete symbols:

hotel, conference, motel – a localist representation

Such symbols for words can be represented by one-hot vectors:

motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]

Vector dimension = number of words in vocabulary (e.g., 500,000+)

Means one 1, the rest 0s

credit: Stanford CS224n

15 of 70

Problem with words as discrete symbols

Example: in web search, if a user searches for “Seattle motel”, we would like to match documents containing “Seattle hotel”

But:

motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]

hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]

These two vectors are orthogonal

There is no natural notion of similarity for one-hot vectors!

Solution:

Could try to rely on WordNet’s list of synonyms to get similarity?

But it is well-known to fail badly: incompleteness, etc.

Instead: learn to encode similarity in the vectors themselves

credit: Stanford CS224n

16 of 70

Representing words by their context

Distributional semantics: A word’s meaning is given by the words that frequently appear close-by

“You shall know a word by the company it keeps” (J. R. Firth 1957)
One of the most successful ideas of modern statistical NLP!

When a word w appears in a text, its context is the set of words that appear nearby (within a fixed-size window).

We use the many contexts of w to build up a representation of w

credit: Stanford CS224n

These context words will represent banking

17 of 70

Poll

Which is not correct?

Conventionally, researchers tried to represent meaning in a computer through synonyms and hypernyms
Representing words by their context words is one of the conventional approaches

18 of 70

Poll

Which is not correct?

Conventionally, researchers tried to represent meaning in a computer through synonyms and hypernyms
Representing words by their context words is one of the conventional approaches

19 of 70

Contents

Motivation
Word vectors
Training word vectors
Count-based methods
Evaluating word vectors
Word senses
Summary

20 of 70

Word vectors

We will build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts, measuring similarity as the vector dot (scalar) product

credit: Stanford CS224n

Note: word vectors are also called (word) embeddings or (neural) word representations They are a distributed representation

21 of 70

Word meaning as a neural word vector – visualization

credit: Stanford CS224n

22 of 70

Word2vec overview

Word2vec (Mikolov et al. 2013) is a framework for learning word vectors

Idea:

We have a large corpus (“body”) of text: a long list of words
Every word in a fixed vocabulary is represented by a vector
Go through each position t in the text, which has a center word c and context (“outside”) words o
Use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa)
Keep adjusting the word vectors to maximize this probability

credit: Stanford CS224n

23 of 70

Word2vec overview

Example windows and process for computing

credit: Stanford CS224n

...

problems

crisis

banking

into

turning

...

center word at position t

outside context words in window of size 2

P(w_t-1|w_t)

P(w_t-2|w_t)

P(w_t+1|w_t)

P(w_t+2|w_t)

24 of 70

Word2vec overview

Example windows and process for computing

credit: Stanford CS224n

...

problems

crisis

banking

into

turning

...

center word at position t

outside context words in window of size 2

P(w_t-1|w_t)

P(w_t-2|w_t)

P(w_t+1|w_t)

P(w_t+2|w_t)

25 of 70

Word2vec: objective function

For each position t = 1, ... , T, predict context words within a window of fixed size m, given center word w_t. Data likelihood:

credit: Stanford CS224n

Minimizing objective function ⟺ Maximizing predictive accuracy

The objective function J(θ) is the (average) negative log likelihood:

26 of 70

Word2vec: objective function

We want to minimize the objective function

Question: How to calculate P(w_t+j|w_t; θ)?
Answer: We will use two vectors per word w:

v: when w is a center word
u: when w is a context word

Then for a center word c and a context word o:

credit: Stanford CS224n

27 of 70

Word2vec with vectors

Example windows and process for computing P(w_t+j|w_t)

...

problems

crisis

banking

into

turning

...

center word at position t

outside context words in window of size 2

P(u_turning|v_into)

P(u_problems|v_into)

P(u_banking|v_into)

P(u_crisis|v_into)

credit: Stanford CS224n

28 of 70

Word2vec: prediction function

This is an example of the softmax function Rⁿ → (0,1)ⁿ

The softmax function maps arbitrary values x_i to a probability distribution p_i

“max” because amplifies probability of largest x_i
“soft” because still assigns some probability to smaller x_i
Frequently used in Deep Learning

credit: Stanford CS224n

1) Dot product compares similarity of o and c.

1) u^Tv = Σu_iv_i

1) Larger dot product = larger probability

2) Exponentiation makes anything positive

3) Normalize over entire vocabulary to give probability distribution

29 of 70

Contents

Motivation
Word vectors
Training word vectors
Count-based methods
Evaluating word vectors
Word senses
Summary

30 of 70

To train the model: Optimize value of parameters to minimize loss

To train a model, we gradually adjust parameters to minimize a loss

Recall: Ө represents all the model

parameters, in one long vector

In our case, d-dimensional vectors

and V-many words

Remember: every word has two vectors

We optimize these parameters by

walking down the gradient (see right figure)

We compute all vector gradients!

credit: Stanford CS224n

31 of 70

Word2vec derivations of gradient

The basic Lego piece: The chain rule
Useful basic fact:
If in doubt: write it out with indices

credit: Stanford CS224n

32 of 70

Chain Rule

Chain rule! If y = f(u) and u = g(x), i.e., y = f(g(x)), then:

Simple example:

credit: Stanford CS224n

33 of 70

Word2vec derivations of gradient

Our objective function:

Gradient for the center word:

credit: Stanford CS224n

34 of 70

Word2vec derivations of gradient

credit: Stanford CS224n

35 of 70

Calculating all gradients!

We went through the gradient for each center vector v in a window

We also need gradients for outside vectors u: Do it on your own!

Generally, in each window we will compute updates for all parameters that are being used in that window.

credit: Stanford CS224n

36 of 70

Word2vec maximizes objective function by putting similar words nearby in space

credit: Stanford CS224n

37 of 70

Word2vec: More details

Why two vectors? → Easier optimization. Average both at the end.

Two model variants:

1. Skip-grams (SG)

Predict context (“outside”) words (position independent) given center word

2. Continuous Bag of Words (CBOW)

Predict center word from (bag of) context words

This lecture so far: Skip-gram model

Additional efficiency in training:

1. Negative sampling

So far: Focus on naïve softmax (simpler but more expensive training method)

credit: Stanford CS224n

38 of 70

The skip-gram model with negative sampling

The normalization term is computationally expensive

Hence, in standard word2vec, you implement the skip-gram model with negative sampling

Main idea: train binary logistic regressions for a true pair (center word and a word in its context window) versus several “noise” pairs (the center word paired with a random word)

credit: Stanford CS224n

39 of 70

The skip-gram model with negative sampling

From paper: “Distributed Representations of Words and Phrases and their Compositionality” (Mikolov et al. 2013)

Overall objective function with negative sampling

credit: Stanford CS224n

We take k negative samples (using word probabilities)

Maximize probability that real outside word appears; minimize probability that random words appear around center word

Sample with P(w)=U(w)^3/4/Z, the unigram distribution U(w) raised to the 3/4 power.

The power makes less frequent words be sampled more often

40 of 70

Stochastic gradients with negative sampling

We iteratively take gradients at each window for SGD

In each window, we only have at most 2m + 1 words plus 2km negative words with negative sampling, so ∇J is very sparse!

credit: Stanford CS224n

41 of 70

Stochastic gradients with negative sampling

We might only update the word vectors that actually appear!

Solution: either you need sparse matrix update operations to only update certain rows of full embedding matrices U and V, or you need to keep around a hash for word vectors

If you have millions of word vectors and do distributed computing, it is important to not have to send gigantic updates around!

credit: Stanford CS224n

42 of 70

Poll

Which is not correct

Word vectors are discrete representation of words
Word vectors are also word called embeddings or word representations
Training word vectors typically requires a large corpus
Skip-grams predict context words given the center word

43 of 70

Poll

Which is not correct

Word vectors are discrete representation of words
Word vectors are also word called embeddings or word representations
Training word vectors typically requires a large corpus
Skip-grams predict context words given the center word

44 of 70

Contents

Motivation
Word vectors
Training word vectors
Count-based methods
Evaluating word vectors
Word senses
Summary

45 of 70

Why not capture co-occurrence counts directly?

There’s something weird about iterating through the whole corpus (perhaps many times); why don’t we just accumulate all the statistics of what words appear near each other?!?

Building a co-occurrence matrix X

2 options: windows vs. full document
Window: Similar to word2vec, use window around each word → captures some syntactic and semantic information (“word space”)
Word-document co-occurrence matrix will give general topics (all sports terms will have similar entries) leading to “Latent Semantic Analysis” (“document space”)

credit: Stanford CS224n

46 of 70

Example: Window based co-occurrence matrix

Window length 1 (more common: 5–10)

Symmetric (irrelevant whether left or right context)

Example corpus:

I like deep learning
I like NLP
I enjoy flying

credit: Stanford CS224n

47 of 70

Co-occurrence vectors

Simple count co-occurrence vectors

Vectors increase in size with vocabulary
Very high dimensional: require a lot of storage (though sparse)
Subsequent classification models have sparsity issues → Models are less robust

Low-dimensional vectors

Idea: store “most” of the important information in a fixed, small number of dimensions → a dense vector
Usually 25–1000 dimensions, similar to word2vec
How to reduce the dimensionality?

credit: Stanford CS224n

48 of 70

Classic method: Dimensionality reduction on X

Singular Value Decomposition of co-occurrence matrix X

Factorizes X into UΣV^T, where U and V are orthonormal

credit: Stanford CS224n

Retain only k singular values, in order to generalize.

M is the best rank k approximation to X , in terms of least squares.

Classic linear algebra result. Expensive to compute for large matrices.

49 of 70

Hacks to X (several used in Rohde et al. 2005 in COALS)

Running an SVD on raw counts doesn’t work well!!!

Scaling the counts in the cells can help a lot

Problem: function words (the, he, has) are too frequent → syntax has too much impact. Some fixes:

log the frequencies
min(X, t), with t ≈ 100
Ignore the function words

Ramped windows that count closer words more than further away words

Use Pearson correlations instead of counts, then set negative values to 0

credit: Stanford CS224n

50 of 70

Scaling the counts in the cells can help a lot

Problem: function words (the, he, has) are too frequent → syntax has too much impact

Pointwise mutual information (PMI)

credit: Deep learning from scratch

51 of 70

Interesting semantic patterns emerge in the scaled vectors

credit: Stanford CS224n

52 of 70

Towards GloVe: Count based vs. direct prediction

LSA, HAL (Lund & Burgess),

COALS, Hellinger-PCA (Rohde

et al, Lebret & Collobert)

Fast training
Efficient usage of statistics
Primarily used to capture word similarity
Disproportionate importance given to large counts

Skip-gram/CBOW (Mikolov et al)

NNLM, HLBL, RNN (Bengio et al; Collobert & Weston; Huang et al; Mnih & Hinton)

Scales with corpus size
Inefficient usage of statistics
Generate improved performance on other tasks
Can capture complex patterns beyond word similarity

credit: Stanford CS224n

53 of 70

Encoding meaning components in vector differences

Crucial insight: Ratios of co-occurrence probabilities can encode meaning components

credit: Stanford CS224n

54 of 70

Encoding meaning components in vector differences

Q: How can we capture ratios of co-occurrence probabilities as linear meaning components in a word vector space (i.e., embedding ratios of co-occurrence probabilities)?

A: Log-bilinear model

with vector differences

credit: Stanford CS224n

55 of 70

Combining the best of both worlds: GloVe

Fast training

Scalable to huge corpora

Good performance even with small corpus and small vectors

credit: Stanford CS224n

56 of 70

GloVe results

Nearest words to frog:

1. frogs

2. toad

3. litoria

4. leptodactylidae

5. rana

6. lizard

7. eleutherodactylus

credit: Stanford CS224n

57 of 70

Poll

Which is correct?

Count-based co-occurrence matrices contain low dimensional word vectors
Singular value decomposition is a computationally-efficient way of reducing the dimensions of word vectors
Pointwise mutual information outputs positive values
Learning the statistics of co-occurrence is similar to constructing a co-occurrence matrix

58 of 70

Poll

Which is correct?

Count-based co-occurrence matrices contain low dimensional word vectors
Singular value decomposition is a computationally-efficient way of reducing the dimensions of word vectors
Pointwise mutual information outputs positive values
Learning the statistics of co-occurrence is similar to constructing a co-occurrence matrix

59 of 70

Contents

Motivation
Word vectors
Training word vectors
Count-based methods
Evaluating word vectors
Word senses
Summary

60 of 70

How to evaluate word vectors?

Related to general evaluation in NLP: Intrinsic vs. extrinsic

Intrinsic:

Evaluation on a specific/intermediate subtask
Fast to compute
Helps to understand that system
Not clear if really helpful unless correlation to real task is established

Extrinsic:

Evaluation on a real task
Can take a long time to compute accuracy
Unclear if the subsystem is the problem or its interaction or other subsystems
If replacing exactly one subsystem with another improves accuracy ⇒ Winning!

credit: Stanford CS224n

61 of 70

Intrinsic word vector evaluation

Word Vector Analogies ⇒ man:woman :: king:?

Evaluate word vectors by how well their cosine distance after addition captures intuitive semantic and syntactic analogy questions

Problem: What if the information is there but not linear?

credit: Stanford CS224n

62 of 70

Intrinsic word vector evaluation

Korean word2vec: http://w.elnn.kr/search/

credit: Stanford CS224n

63 of 70

Another intrinsic word vector evaluation

Word vector distances and their correlation with human judgments

Example dataset: WordSim353 http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

credit: Stanford CS224n

64 of 70

Extrinsic word vector evaluation

One example where good word vectors should help directly: named entity recognition

credit: Stanford CS224n

65 of 70

Contents

Motivation
Word vectors
Training word vectors
Count-based methods
Evaluating word vectors
Word senses
Summary

66 of 70

Word senses and word sense ambiguity

Most words have lots of meanings!

Especially common words
Especially words that have existed for a long time

Example: pike

Does one vector capture all these meanings or do we have a mess?

credit: Stanford CS224n

67 of 70

pike

A sharp point or staff
A type of elongated fish
A railroad line or system
A type of road
The future (coming down the pike)
A type of body position (as in diving)
To kill or pierce with a pike
To make one’s way (pike along)
In Australian English, pike means to pull out from doing something: I reckon he could have climbed that cliff, but he piked!

credit: Stanford CS224n

68 of 70

Improving Word Representations Via Global Context And Multiple Word Prototypes (Huang et al. 2012)

• Idea: Cluster word windows around words, retrain with each word assigned to multiple different clusters bank1, bank2, etc.

credit: Stanford CS224n

69 of 70

Contents

Motivation
Word vectors
Training word vectors
Count-based methods
Evaluating word vectors
Word senses
Summary

70 of 70

Summary

Today we

talked about the motivation of word vectors
discussed how we train word vectors
had a look at evaluating word vectors
introduced some future research directions