1 of 70

AI Experience Lab

[Word Embeddings]

Ue-Hwan, Kim

2 of 70

Contents

  • Motivation
  • Word vectors
  • Training word vectors
  • Count-based methods
  • Evaluating word vectors
  • Word senses
  • Summary

2

3 of 70

Contents

  • Motivation
  • Word vectors
  • Training word vectors
  • Count-based methods
  • Evaluating word vectors
  • Word senses
  • Summary

3

4 of 70

4

credit: Stanford CS224n

5 of 70

5

credit: Stanford CS224n

6 of 70

6

credit: Stanford CS224n

7 of 70

7

credit: Stanford CS224n

8 of 70

8

credit: Stanford CS224n

9 of 70

9

credit: Stanford CS224n

10 of 70

How do we represent the meaning of a word?

Definition: meaning (Webster dictionary)

  • the idea that is represented by a word, phrase, etc.
  • the idea that a person wants to express by using words, signs, etc.
  • the idea that is expressed in a work of writing, art, etc.

Commonest linguistic way of thinking of meaning:

10

credit: Stanford CS224n

11 of 70

How do we have usable meaning in a computer?

Previously commonest NLP solution: Use, e.g., WordNet, a thesaurus containing lists of synonym sets and hypernyms (“is a” relationships)

11

credit: Stanford CS224n

12 of 70

How do we have usable meaning in a computer?

Previously commonest NLP solution: Use, e.g., WordNet, a thesaurus containing lists of synonym sets and hypernyms (“is a” relationships)

12

credit: Stanford CS224n

13 of 70

Problems with resources like WordNet

A useful resource but missing nuance:

  • e.g., “proficient” is listed as a synonym for “good”: This is only correct in some contexts
  • Also, WordNet list offensive synonyms in some synonym sets without any coverage of the connotations or appropriateness of words

Missing new meanings of words:

  • e.g., wicked, badass, nifty, wizard, genius, ninja, bombest
  • Impossible to keep up-to-date!

Subjective

Requires human labor to create and adapt

Can’t be used to accurately compute word similarity (see following slides)

13

credit: Stanford CS224n

14 of 70

Representing words as discrete symbols

In traditional NLP, we regard words as discrete symbols:

  • hotel, conference, motel – a localist representation

Such symbols for words can be represented by one-hot vectors:

In traditional NLP, we regard words as discrete symbols:

  • hotel, conference, motel – a localist representation

Such symbols for words can be represented by one-hot vectors:

  • motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
  • hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]

Vector dimension = number of words in vocabulary (e.g., 500,000+)

14

Means one 1, the rest 0s

credit: Stanford CS224n

15 of 70

Problem with words as discrete symbols

Example: in web search, if a user searches for “Seattle motel”, we would like to match documents containing “Seattle hotel

But:

motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]

hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]

These two vectors are orthogonal

There is no natural notion of similarity for one-hot vectors!

Solution:

  • Could try to rely on WordNet’s list of synonyms to get similarity?
    • But it is well-known to fail badly: incompleteness, etc.
  • Instead: learn to encode similarity in the vectors themselves

15

credit: Stanford CS224n

16 of 70

Representing words by their context

Distributional semantics: A word’s meaning is given by the words that frequently appear close-by

  • You shall know a word by the company it keeps” (J. R. Firth 1957)
  • One of the most successful ideas of modern statistical NLP!

When a word w appears in a text, its context is the set of words that appear nearby (within a fixed-size window).

We use the many contexts of w to build up a representation of w

16

credit: Stanford CS224n

These context words will represent banking

17 of 70

Poll

Which is not correct?

  • Conventionally, researchers tried to represent meaning in a computer through synonyms and hypernyms
  • Representing words by their context words is one of the conventional approaches

17

18 of 70

Poll

Which is not correct?

  • Conventionally, researchers tried to represent meaning in a computer through synonyms and hypernyms
  • Representing words by their context words is one of the conventional approaches

18

19 of 70

Contents

  • Motivation
  • Word vectors
  • Training word vectors
  • Count-based methods
  • Evaluating word vectors
  • Word senses
  • Summary

19

20 of 70

Word vectors

We will build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts, measuring similarity as the vector dot (scalar) product

20

credit: Stanford CS224n

Note: word vectors are also called (word) embeddings or (neural) word representations They are a distributed representation

21 of 70

Word meaning as a neural word vector – visualization

21

credit: Stanford CS224n

22 of 70

Word2vec overview

Word2vec (Mikolov et al. 2013) is a framework for learning word vectors

Idea:

  • We have a large corpus (“body”) of text: a long list of words
  • Every word in a fixed vocabulary is represented by a vector
  • Go through each position t in the text, which has a center word c and context (“outside”) words o
  • Use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa)
  • Keep adjusting the word vectors to maximize this probability

22

credit: Stanford CS224n

23 of 70

Word2vec overview

Example windows and process for computing

23

credit: Stanford CS224n

...

problems

crisis

banking

into

turning

as

...

center word at position t

outside context words in window of size 2

outside context words in window of size 2

P(wt-1|wt)

P(wt-2|wt)

P(wt+1|wt)

P(wt+2|wt)

24 of 70

Word2vec overview

Example windows and process for computing

24

credit: Stanford CS224n

...

problems

crisis

banking

into

turning

as

...

center word at position t

outside context words in window of size 2

outside context words in window of size 2

P(wt-1|wt)

P(wt-2|wt)

P(wt+1|wt)

P(wt+2|wt)

25 of 70

Word2vec: objective function

For each position t = 1, ... , T, predict context words within a window of fixed size m, given center word wt. Data likelihood:

25

credit: Stanford CS224n

Minimizing objective function ⟺ Maximizing predictive accuracy

The objective function J(θ) is the (average) negative log likelihood:

26 of 70

Word2vec: objective function

We want to minimize the objective function

  • Question: How to calculate P(wt+j|wt; θ)?
  • Answer: We will use two vectors per word w:
    • v: when w is a center word
    • u: when w is a context word
  • Then for a center word c and a context word o:

26

credit: Stanford CS224n

27 of 70

Word2vec with vectors

Example windows and process for computing P(wt+j|wt)

27

...

problems

crisis

banking

into

turning

as

...

center word at position t

outside context words in window of size 2

outside context words in window of size 2

P(uturning|vinto)

P(uproblems|vinto)

P(ubanking|vinto)

P(ucrisis|vinto)

credit: Stanford CS224n

28 of 70

Word2vec: prediction function

This is an example of the softmax function Rn → (0,1)n

The softmax function maps arbitrary values xi to a probability distribution pi

  • “max” because amplifies probability of largest xi
  • soft” because still assigns some probability to smaller xi
  • Frequently used in Deep Learning

28

credit: Stanford CS224n

1) Dot product compares similarity of o and c.

1) uTv = Σuivi

1) Larger dot product = larger probability

2) Exponentiation makes anything positive

3) Normalize over entire vocabulary to give probability distribution

29 of 70

Contents

  • Motivation
  • Word vectors
  • Training word vectors
  • Count-based methods
  • Evaluating word vectors
  • Word senses
  • Summary

29

30 of 70

To train the model: Optimize value of parameters to minimize loss

To train a model, we gradually adjust parameters to minimize a loss

  • Recall: Ө represents all the model

parameters, in one long vector

  • In our case, d-dimensional vectors

and V-many words

  • Remember: every word has two vectors

  • We optimize these parameters by

walking down the gradient (see right figure)

  • We compute all vector gradients!

30

credit: Stanford CS224n

31 of 70

Word2vec derivations of gradient

  • The basic Lego piece: The chain rule
  • Useful basic fact:
  • If in doubt: write it out with indices

31

credit: Stanford CS224n

32 of 70

Chain Rule

Chain rule! If y = f(u) and u = g(x), i.e., y = f(g(x)), then:

32

Simple example:

credit: Stanford CS224n

33 of 70

Word2vec derivations of gradient

Our objective function:

33

Gradient for the center word:

credit: Stanford CS224n

34 of 70

Word2vec derivations of gradient

34

credit: Stanford CS224n

35 of 70

Calculating all gradients!

We went through the gradient for each center vector v in a window

We also need gradients for outside vectors u: Do it on your own!

Generally, in each window we will compute updates for all parameters that are being used in that window.

35

credit: Stanford CS224n

36 of 70

Word2vec maximizes objective function by putting similar words nearby in space

36

credit: Stanford CS224n

37 of 70

Word2vec: More details

Why two vectors? → Easier optimization. Average both at the end.

Two model variants:

1. Skip-grams (SG)

Predict context (“outside”) words (position independent) given center word

2. Continuous Bag of Words (CBOW)

Predict center word from (bag of) context words

This lecture so far: Skip-gram model

Additional efficiency in training:

1. Negative sampling

So far: Focus on naïve softmax (simpler but more expensive training method)

37

credit: Stanford CS224n

38 of 70

The skip-gram model with negative sampling

The normalization term is computationally expensive

Hence, in standard word2vec, you implement the skip-gram model with negative sampling

Main idea: train binary logistic regressions for a true pair (center word and a word in its context window) versus several “noise” pairs (the center word paired with a random word)

38

credit: Stanford CS224n

39 of 70

The skip-gram model with negative sampling

From paper: “Distributed Representations of Words and Phrases and their Compositionality” (Mikolov et al. 2013)

Overall objective function with negative sampling

39

credit: Stanford CS224n

We take k negative samples (using word probabilities)

Maximize probability that real outside word appears; minimize probability that random words appear around center word

Sample with P(w)=U(w)3/4/Z, the unigram distribution U(w) raised to the 3/4 power.

  • The power makes less frequent words be sampled more often

40 of 70

Stochastic gradients with negative sampling

We iteratively take gradients at each window for SGD

In each window, we only have at most 2m + 1 words plus 2km negative words with negative sampling, so ∇J is very sparse!

40

credit: Stanford CS224n

41 of 70

Stochastic gradients with negative sampling

We might only update the word vectors that actually appear!

Solution: either you need sparse matrix update operations to only update certain rows of full embedding matrices U and V, or you need to keep around a hash for word vectors

If you have millions of word vectors and do distributed computing, it is important to not have to send gigantic updates around!

41

credit: Stanford CS224n

42 of 70

Poll

Which is not correct

  • Word vectors are discrete representation of words
  • Word vectors are also word called embeddings or word representations
  • Training word vectors typically requires a large corpus
  • Skip-grams predict context words given the center word

42

43 of 70

Poll

Which is not correct

  • Word vectors are discrete representation of words
  • Word vectors are also word called embeddings or word representations
  • Training word vectors typically requires a large corpus
  • Skip-grams predict context words given the center word

43

44 of 70

Contents

  • Motivation
  • Word vectors
  • Training word vectors
  • Count-based methods
  • Evaluating word vectors
  • Word senses
  • Summary

44

45 of 70

Why not capture co-occurrence counts directly?

There’s something weird about iterating through the whole corpus (perhaps many times); why don’t we just accumulate all the statistics of what words appear near each other?!?

Building a co-occurrence matrix X

  • 2 options: windows vs. full document
  • Window: Similar to word2vec, use window around each word → captures some syntactic and semantic information (“word space”)
  • Word-document co-occurrence matrix will give general topics (all sports terms will have similar entries) leading to “Latent Semantic Analysis” (“document space”)

45

credit: Stanford CS224n

46 of 70

Example: Window based co-occurrence matrix

Window length 1 (more common: 5–10)

Symmetric (irrelevant whether left or right context)

Example corpus:

  • I like deep learning
  • I like NLP
  • I enjoy flying

46

credit: Stanford CS224n

47 of 70

Co-occurrence vectors

Simple count co-occurrence vectors

  • Vectors increase in size with vocabulary
  • Very high dimensional: require a lot of storage (though sparse)
  • Subsequent classification models have sparsity issues → Models are less robust

Low-dimensional vectors

  • Idea: store “most” of the important information in a fixed, small number of dimensions → a dense vector
  • Usually 25–1000 dimensions, similar to word2vec
  • How to reduce the dimensionality?

47

credit: Stanford CS224n

48 of 70

Classic method: Dimensionality reduction on X

Singular Value Decomposition of co-occurrence matrix X

Factorizes X into UΣVT, where U and V are orthonormal

48

credit: Stanford CS224n

Retain only k singular values, in order to generalize.

M is the best rank k approximation to X , in terms of least squares.

Classic linear algebra result. Expensive to compute for large matrices.

49 of 70

Hacks to X (several used in Rohde et al. 2005 in COALS)

Running an SVD on raw counts doesn’t work well!!!

Scaling the counts in the cells can help a lot

  • Problem: function words (the, he, has) are too frequent → syntax has too much impact. Some fixes:
    • log the frequencies
    • min(X, t), with t ≈ 100
    • Ignore the function words

Ramped windows that count closer words more than further away words

Use Pearson correlations instead of counts, then set negative values to 0

49

credit: Stanford CS224n

50 of 70

Scaling the counts in the cells can help a lot

Problem: function words (the, he, has) are too frequent → syntax has too much impact

Pointwise mutual information (PMI)

50

credit: Deep learning from scratch

51 of 70

Interesting semantic patterns emerge in the scaled vectors

51

credit: Stanford CS224n

52 of 70

Towards GloVe: Count based vs. direct prediction

52

LSA, HAL (Lund & Burgess),

COALS, Hellinger-PCA (Rohde

et al, Lebret & Collobert)

  • Fast training
  • Efficient usage of statistics
  • Primarily used to capture word similarity
  • Disproportionate importance given to large counts

Skip-gram/CBOW (Mikolov et al)

NNLM, HLBL, RNN (Bengio et al; Collobert & Weston; Huang et al; Mnih & Hinton)

  • Scales with corpus size
  • Inefficient usage of statistics
  • Generate improved performance on other tasks
  • Can capture complex patterns beyond word similarity

credit: Stanford CS224n

53 of 70

Encoding meaning components in vector differences

Crucial insight: Ratios of co-occurrence probabilities can encode meaning components

53

credit: Stanford CS224n

54 of 70

Encoding meaning components in vector differences

Q: How can we capture ratios of co-occurrence probabilities as linear meaning components in a word vector space (i.e., embedding ratios of co-occurrence probabilities)?

A: Log-bilinear model

54

with vector differences

credit: Stanford CS224n

55 of 70

Combining the best of both worlds: GloVe

Fast training

Scalable to huge corpora

Good performance even with small corpus and small vectors

55

credit: Stanford CS224n

56 of 70

GloVe results

Nearest words to frog:

1. frogs

2. toad

3. litoria

4. leptodactylidae

5. rana

6. lizard

7. eleutherodactylus

56

credit: Stanford CS224n

57 of 70

Poll

Which is correct?

  • Count-based co-occurrence matrices contain low dimensional word vectors
  • Singular value decomposition is a computationally-efficient way of reducing the dimensions of word vectors
  • Pointwise mutual information outputs positive values
  • Learning the statistics of co-occurrence is similar to constructing a co-occurrence matrix

57

58 of 70

Poll

Which is correct?

  • Count-based co-occurrence matrices contain low dimensional word vectors
  • Singular value decomposition is a computationally-efficient way of reducing the dimensions of word vectors
  • Pointwise mutual information outputs positive values
  • Learning the statistics of co-occurrence is similar to constructing a co-occurrence matrix

58

59 of 70

Contents

  • Motivation
  • Word vectors
  • Training word vectors
  • Count-based methods
  • Evaluating word vectors
  • Word senses
  • Summary

59

60 of 70

How to evaluate word vectors?

Related to general evaluation in NLP: Intrinsic vs. extrinsic

Intrinsic:

  • Evaluation on a specific/intermediate subtask
  • Fast to compute
  • Helps to understand that system
  • Not clear if really helpful unless correlation to real task is established

Extrinsic:

  • Evaluation on a real task
  • Can take a long time to compute accuracy
  • Unclear if the subsystem is the problem or its interaction or other subsystems
  • If replacing exactly one subsystem with another improves accuracy ⇒ Winning!

60

credit: Stanford CS224n

61 of 70

Intrinsic word vector evaluation

Word Vector Analogies ⇒ man:woman :: king:?

  • Evaluate word vectors by how well their cosine distance after addition captures intuitive semantic and syntactic analogy questions

Problem: What if the information is there but not linear?

61

credit: Stanford CS224n

62 of 70

Intrinsic word vector evaluation

Korean word2vec: http://w.elnn.kr/search/

62

credit: Stanford CS224n

63 of 70

Another intrinsic word vector evaluation

Word vector distances and their correlation with human judgments

Example dataset: WordSim353 http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

63

credit: Stanford CS224n

64 of 70

Extrinsic word vector evaluation

One example where good word vectors should help directly: named entity recognition

64

credit: Stanford CS224n

65 of 70

Contents

  • Motivation
  • Word vectors
  • Training word vectors
  • Count-based methods
  • Evaluating word vectors
  • Word senses
  • Summary

65

66 of 70

Word senses and word sense ambiguity

Most words have lots of meanings!

  • Especially common words
  • Especially words that have existed for a long time

Example: pike

Does one vector capture all these meanings or do we have a mess?

66

credit: Stanford CS224n

67 of 70

pike

  • A sharp point or staff
  • A type of elongated fish
  • A railroad line or system
  • A type of road
  • The future (coming down the pike)
  • A type of body position (as in diving)
  • To kill or pierce with a pike
  • To make one’s way (pike along)
  • In Australian English, pike means to pull out from doing something: I reckon he could have climbed that cliff, but he piked!

67

credit: Stanford CS224n

68 of 70

Improving Word Representations Via Global Context And Multiple Word Prototypes (Huang et al. 2012)

• Idea: Cluster word windows around words, retrain with each word assigned to multiple different clusters bank1, bank2, etc.

68

credit: Stanford CS224n

69 of 70

Contents

  • Motivation
  • Word vectors
  • Training word vectors
  • Count-based methods
  • Evaluating word vectors
  • Word senses
  • Summary

69

70 of 70

Summary

Today we

  • talked about the motivation of word vectors
  • discussed how we train word vectors
  • had a look at evaluating word vectors
  • introduced some future research directions

70