1 of 46

Word Embedding Explained

and Visualized

Xin Rong

School of Information

University of Michigan

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

2 of 46

About word2vec...

Two original papers published in association with word2vec by Mikolov et al. (2013)

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

3 of 46

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

4 of 46

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

5 of 46

CBOW

Skip-gram

word2vec model architecture

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

6 of 46

Christopher Manning

Deep Learning Summer School (2015)

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

7 of 46

Atomic Word Representation

Word is represented by identity

[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0]

orange

apple

car

[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0]

[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0]

apple juice

_____ juice

?

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

8 of 46

Distributed Representation

Word is represented as continuous level of activations

apple

orange

car

inhibited

excited

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

9 of 46

Contextual Representation

Word is represented by context in use

I eat an apple every day.

I eat an orange every day.

I like driving my car to work.

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

10 of 46

Word Vectors

apple

orange

banana

rice

bus

milk

juice

car

train

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

11 of 46

Word Analogy

king

queen

man

woman

uncle

aunt

Mikolov & Chen et al. 2013

Mikolov & Sutskever et al. 2013

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

12 of 46

Applications benefited by word embeddings

  • Dependency parsing
  • Named entity recognition
  • Document classification
  • Sentiment Analysis
  • Paraphrase Detection
  • Word Clustering
  • Machine Translation

Table 7: Comparison and combination of models on the Microsoft Sentence Completion Challenge

"Efficient estimation of word representations in vector space" Mikolov & Chen et al. 2013

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

13 of 46

word2vec as a (powerful) black box

Wikipedia

word2vec

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

14 of 46

word2vec decomposed

Input

Words

Output

Words

Input Corpus

Vocabulary

Builder

Dynamic Window Scaling

Subsampling

Pruning

Context Builder

Vocabulary

Lossy Counting

Sentence Windows

Parameter Learner

Backpropagation

Hierarchical Softmax

Negative Sampling

CBOW

Skip-gram

Vectors

(Final Product)

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

15 of 46

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

16 of 46

Basic Neuron Structure

a neuron

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

17 of 46

Training a Single Neuron

a neuron

t

x

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

18 of 46

Task Afforded by a Single Neuron

Item

Edible?

apple

Y

orange

Y

car

N

...

...

paper

N

A lookup table

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

19 of 46

Multilayer Neural Network

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

20 of 46

Backpropagation

t

x

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

21 of 46

word2vec network

Structure Highlights:

  • input layer
    • one-hot vector

  • hidden layer
    • linear (identity)

  • output layer
    • softmax

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

22 of 46

word2vec network

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

23 of 46

neural network and weight matrices

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

24 of 46

neural network and word vectors

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

25 of 46

Training: updating word vectors

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

26 of 46

Intuition of output vector update rule

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

27 of 46

Intuitive Understanding of Input Vectors

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

28 of 46

Resemblance to a force-directed graph

equilibrium length ~ - strength of co-occurrence

more on this later...

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

29 of 46

word2vec decomposed

Input

Words

Output

Words

Input Corpus

Vocabulary

Builder

Dynamic Window Scaling

Subsampling

Pruning

Context Builder

Vocabulary

Lossy Counting

Sentence Windows

Parameter Learner

Backpropagation

Hierarchical Softmax

Negative Sampling

CBOW

Skip-gram

Vectors

(Final Product)

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

30 of 46

How do we select input and output words?

Method 1: continuous bag-of-word (CBOW)

eat an apple every day

Method 2: skip-gram (SG)

eat an apple every day

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

31 of 46

wevi

demo

(CBOW and SG)

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

32 of 46

word2vec decomposed

Input

Words

Output

Words

Input Corpus

Vocabulary

Builder

Dynamic Window Scaling

Subsampling

Pruning

Context Builder

Vocabulary

Lossy Counting

Sentence Windows

Parameter Learner

Backpropagation

Hierarchical Softmax

Negative Sampling

CBOW

Skip-gram

Vectors

(Final Product)

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

33 of 46

Training Generic Softmax is Intractable

need to update every single output vector!

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

34 of 46

Hierarchical Softmax

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

35 of 46

Negative Sampling

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

36 of 46

word2vec decomposed

Input

Words

Output

Words

Input Corpus

Vocabulary

Builder

Dynamic Window Scaling

Subsampling

Pruning

Context Builder

Vocabulary

Lossy Counting

Sentence Windows

Parameter Learner

Backpropagation

Hierarchical Softmax

Negative Sampling

CBOW

Skip-gram

Vectors

(Final Product)

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

37 of 46

Interpreting word embedding model

Coocurrence Matrix

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

38 of 46

Levy & Goldberg. "Neural word embedding as implicit matrix factorization." NIPS 2014.

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

39 of 46

Levy, Goldberg, and Dagan. ACL (2015)

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

40 of 46

Other neural embedding models

  • Bengio (2003)
  • Mnih & Hinton (2008)
  • Collobert & Weston (2008)
  • Minh et a. (2013)
  • GloVe by Pennington et al. (2014)
  • DeepWalk by Perozzi (2014)
  • LINE by Tang et al. (2015)

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

41 of 46

More on Word Analogy

Rohde, Gonnerman & Plaut (2005)�"An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence"

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

42 of 46

wevi

demo

(word analogy)

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

43 of 46

"Miscellaneous" Embedding

Alternative input/output setting:

  • User -> Tweet
  • Customer -> Product
  • Word -> Parse-tree Neighbors
  • Entity -> Neighbor in a list
  • Node -> Random walk path in a graph

Word -> Word

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

44 of 46

Limitations

  • Word ambiguity
  • Debuggability
  • Sequence

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

45 of 46

Take-aways

  • Word embedding techniques perform either explicit or implicit matrix factorization to word co-occurrence matrix.
  • The neural network structure of word2vec is a feedforward network with one hidden layer with a linear activation function.
  • The training method of word2vec is backpropagation with stochastic gradient descent.
  • The loss function is cross entropy.
  • Training can be made feasible by using either hierarchical softmax or negative sampling.
  • Word embeddings of equal quality as word2vec can be obtained via count-based methods with carefully tuned co-occurrence metrics.

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015

46 of 46

Thanks!

Xin Rong - School of Information - University of Michigan

a2-dlearn - Nov 7th, 2015