1 of 96

Natural Language Processing

2 of 96

Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation.

3 of 96

What is natural language processing used for?

Some of the main functions that natural language processing algorithms perform are:

Text classification. This involves assigning tags to texts to put them in categories. This can be useful for sentiment analysis, which helps the natural language processing algorithm determine the sentiment, or emotion behind a text. For example, when brand A is mentioned in X number of texts, the algorithm can determine how many of those mentions were positive and how many were negative. It can also be useful for intent detection, which helps predict what the speaker or writer may do based on the text they are producing.

Text extraction. This involves automatically summarizing text and finding important pieces of data.

4 of 96

One example of this is keyword extraction, which pulls the most important words from the text, which can be useful for search engine optimization. Doing this with natural language processing requires some programming

-- it is not completely automated.

However, there are plenty of simple keyword extraction tools that automate most of the process -- the user just has to set parameters within the program.

For example, a tool might pull out the most frequently used words in the text. Another example is named entity recognition, which extracts the names of people, places and other entities from text.

5 of 96

Machine translation. This is the process by which a computer translates text from one language, such as English, to another language, such as French, without human intervention.

Natural language generation. This involves using natural language processing algorithms to analyze unstructured data and automatically produce content based on that data. One example of this is in language models such as GPT3, which are able to analyze an unstructured text and then generate believable articles based on the text.

6 of 96

The functions listed above are used in a variety of real-world applications, including:

customer feedback analysis -- where AI analyzes social media reviews;

customer service automation -- where voice assistants on the other end of a customer service phone line are able to use speech recognition to understand what the customer is saying, so that it can direct the call correctly;

automatic translation -- using tools such as Google Translate, Bing Translator and Translate Me;

academic research and analysis -- where AI is able to analyze huge amounts of academic material and research papers not just based on the metadata of the text, but the text itself;

7 of 96

insights to predict, and ideally prevent, disease;

word processors used for plagiarism and proofreading -- using tools such as Grammarly and Microsoft Word;

stock forecasting and insights into financial trading -- using AI to analyze market history and 10-K documents, which contain comprehensive summaries about a company's financial performance;

talent recruitment in human resources;

automation of routine litigation tasks -- one example is the artificially intelligent attorney.

Research being done on natural language processing revolves around search, especially Enterprise search. This involves having users query data sets in the form of a question that they might pose to another person. The machine interprets the important elements of the human language sentence, which correspond to specific features in a data set, and returns an answer.

8 of 96

analysis and categorization of medical records

NLP can be used to interpret free, unstructured text and make it analyzable. There is a tremendous amount of information stored in free text files, such as patients' medical records. Before deep learning-based NLP models, this information was inaccessible to computer-assisted analysis and could not be analyzed in any systematic way. With NLP analysts can sift through massive amounts of free text to find relevant information.

Sentiment analysis is another primary use case for NLP. Using sentiment analysis, data scientists can assess comments on social media to see how their business's brand is performing, or review notes from customer service teams to identify areas where people want the business to perform better.

9 of 96

10 of 96

Word Embeddings

11 of 96

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis.

Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that words that are closer in the vector space are expected to be similar in meaning.[1]

Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

12 of 96

What is going to happen:

  • Why do we need word representations?
  • One-hot Vectors
  • Distributional Semantics
  • Count-Based Methods
  • Word2Vec (Prediction-based Method)
  • GloVe
  • Evaluation
  • Analysis and Interpretability

12

13 of 96

What is going to happen:

Why do we need word representations?

  • One-hot Vectors
  • Distributional Semantics
  • Count-Based Methods
  • Word2Vec (Prediction-based Method)
  • GloVe
  • Evaluation
  • Analysis and Interpretability

13

14 of 96

Why do we need word representations?

I saw a cat.

Text (your input)

14

15 of 96

Why do we need word representations?

I saw a cat.

Text (your input)

Sequence of tokens

I

saw a

cat

.

15

16 of 96

Why do we need word representations?

I saw a cat.

Text (your input)

Sequence of tokens

Word representation - vector (input for your model/algorithm)

I

saw a

cat

.

16

17 of 96

Why do we need word representations?

I saw a cat.

Text (your input)

Sequence of tokens

Your algorithm (e.g., neural network)

Word representation - vector (input for your model/algorithm)

Any algorithm for solving a task

I

saw a

cat

.

17

18 of 96

How it works: Look-up Table

Token index in the vocabulary

39 1592 10 2548 5

39

I

saw a

cat

.

Embedding dimension

18

Vocabulary size

19 of 96

How it works: Look-up Table

Token index in the vocabulary

39 1592 10 2548 5

1592

I

saw a

cat

.

Embedding dimension

19

Vocabulary size

20 of 96

How it works: Look-up Table

20

Token index in the vocabulary

39 1592 10 2548 5

10

I

saw a

cat

.

Embedding dimension

Vocabulary size

21 of 96

How it works: Look-up Table

21

Token index in the vocabulary

39 1592 10 2548 5

2548

I

saw a

cat

.

Embedding dimension

Vocabulary size

22 of 96

How it works: Look-up Table

22

Token index in the vocabulary

39 1592 10 2548 5

5

I

saw a

cat

.

Embedding dimension

Vocabulary size

23 of 96

How it works: Look-up Table

23

Token index in the vocabulary

39 1592 10 2548 5

5

I

saw a

cat

.

Embedding dimension

Vocabulary size

“Look up” a token embedding in the table

24 of 96

Note UNKs: Out-of-Vocabulary Tokens

24

I

saw a

&%!

.

I

saw

a UNK .

not in the vocabulary

Vocabulary is chosen in advance

Therefore, some tokens may be “unknown” – you can use a special token for them

25 of 96

How can we gat word representations?

25

I saw a cat.

Your algorithm (e.g., neural network)

I

saw a

cat

.

In the following:

How can we get these representations?

26 of 96

What is going to happen:

26

Why do we need word representations?

  • One-hot Vectors
  • Distributional Semantics
  • Count-Based Methods
  • Word2Vec (Prediction-based Method)
  • GloVe
  • Evaluation
  • Analysis and Interpretability

27 of 96

What is going to happen:

  • Why do we need word representations?

  • Distributional Semantics
  • Count-Based Methods
  • Word2Vec (Prediction-based Method)
  • GloVe
  • Evaluation
  • Analysis and Interpretability

27

One-hot Vectors

28 of 96

One-Hot Vectors: Represent Words as Discrete Symbols

28

0…010…0….0…0

dog cat table

0…0…010….0…0

0…0…0….0010…

One is 1, the rest are 0

Embedding dimension = vocabulary size

29 of 96

One-Hot Vectors: Represent Words as Discrete Symbols

29

0…010…0….0…0

dog cat table

0…0…010….0…0

0…0…0….0010…

One is 1, the rest are 0

Embedding dimension = vocabulary size

Any problems?

30 of 96

One-Hot Vectors: Represent Words as Discrete Symbols

30

0…010…0….0…0

dog cat table

0…0…010….0…0

0…0…0….0010…

One is 1, the rest are 0

Embedding dimension = vocabulary size

  • Vector size is too large
  • Vectors know nothing about meaning e.g., cat is as close to

dog as it is to table!

Problems:

31 of 96

One-Hot Vectors: Represent Words as Discrete Symbols

31

0…010…0….0…0

dog cat table

0…0…010….0…0

0…0…0….0010…

One is 1, the rest are 0

Embedding dimension = vocabulary size

Problems:

  • Vector size is too large
  • Vectors know nothing about meaning

e.g., cat is as close to

dog as it is to table!

What is meaning?

32 of 96

What is meaning?

32

Do you know what the word tezgüino means ?

(We hope you do not)

33 of 96

What is meaning?

33

Now look how this word is used in different contexts:

A bottle of tezgüino is on the table. Everyone likes tezgüino.

Tezgüino makes you drunk.

We make tezgüino out of corn.

Can you understand what tezgüino means ?

34 of 96

What is meaning?

34

Now look how this word is used in different contexts:

A bottle of tezgüino is on the table. Everyone likes tezgüino.

Tezgüino makes you drunk.

We make tezgüino out of corn.

With context, you can understand the meaning!

Tezgüino is a kind of alcoholic beverage made from corn.

35 of 96

35

What is meaning?

How did you do this?

36 of 96

What is meaning?

36

(1) A bottle of

is on the table.

  1. Everyone likes .
  2. makes you drunk.

(4) We make out of corn.

What other words fit into these contexts ?

37 of 96

What is meaning?

37

What other words fit into these contexts ?

(1) (2) (3) (4) …

tezgüino 1 1 1 1

loud 0 0 0 0

motor oil 1 0 0 1

tortillas 0 1 0 1

wine 1 1 1 0

(1) A bottle of

is on the table.

  1. Everyone likes .
  2. makes you drunk.

(4) We make out of corn.

contexts

rows show contextual properties: 1 if a word can appear in the context, 0 if not

38 of 96

What is meaning?

38

(1) A bottle of

is on the table.

  1. Everyone likes .
  2. makes you drunk.
  3. We make out of corn.

(1) (2) (3) (4) …

tezgüino loud motor oil tortillas wine

1 1 1 1

0 0 0 0

1 0 0 1

0 1 0 1

1 1 1 0

rows are similar

39 of 96

What is meaning?

39

(1) A bottle of

is on the table.

  1. Everyone likes .
  2. makes you drunk.
  3. We make out of corn.

(1) (2) (3) (4) …

tezgüino loud motor oil tortillas wine

1 1 1 1

0 0 0 0

1 0 0 1

0 1 0 1

1 1 1 0

rows are similar

meanings of the words are similar

Is this true?

40 of 96

What is meaning?

40

This is the distributional hypothesis

(1) A bottle of

is on the table.

  1. Everyone likes .
  2. makes you drunk.
  3. We make out of corn.

tezgüino loud motor oil tortillas wine

1 1 1 1

0 0 0 0

1 0 0 1

0 1 0 1

1 1 1 0

(1) (2) (3) (4) …

rows are similar

meanings of the words are similar

41 of 96

Distributional Hypothesis

41

Words which frequently appear in similar contexts have similar meaning.

(Harris 1954, Firth 1957)

42 of 96

Distributional Hypothesis

42

Words which frequently appear in similar contexts have similar meaning.

(Harris 1954, Firth 1957)

This can be used in practice to build word vectors!

43 of 96

Distributional Hypothesis

43

Words which frequently appear in similar contexts have similar meaning.

We have to put information about contexts into word vectors.

(Harris 1954, Firth 1957)

Main idea:

What comes next: different ways to do this

44 of 96

What is going to happen:

  • Why do we need word representations?
  • One-hot Vectors

  • Count-Based Methods
  • Word2Vec (Prediction-based Method)
  • GloVe
  • Evaluation
  • Analysis and Interpretability

44

Distributional Semantics

45 of 96

What is going to happen:

  • Why do we need word representations?
  • One-hot Vectors
  • Distributional Semantics

  • Word2Vec (Prediction-based Method)
  • GloVe
  • Evaluation
  • Analysis and Interpretability

45

Count-Based Methods

46 of 96

Count-Based Methods

47 of 96

Count-Based Methods Idea

47

We have to put information about contexts into word vectors.

Let’s remember our main idea:

48 of 96

Count-Based Methods: Idea

48

How: Put this information manually based on global corpus statistics.

We have to put information about contexts into word vectors.

Let’s remember our main idea:

Count-based methods take this idea quite literally :)

49 of 96

Count-Based Methods: The General Pipeline

50 of 96

Count-Based Methods: The General Pipeline

51 of 96

Count-Based Methods: The General Pipeline

Need to define:

  • what is context

  • how to compute matrix elements

52 of 96

Simple: Co-Occurrence Counts

I saw a

cat

the garden …

cute

grey

playing

in

contexts for cat

2-sized window for cat

N(w, c) – number of times word w appears in context c

Context:

  • surrounding words in a L-sized window

Matrix element:

53 of 96

Simple: Co-Occurrence Counts

I saw a

cat

the garden …

cute

grey

playing

in

contexts for cat

2-sized window for cat

Context:

  • surrounding words in a L-sized window

Matrix element:

NLP Course For You

N(w, c) – number of times word w appears in context c

54 of 96

Positive Pointwise Mutual Information (PPMI)

Context:

  • surrounding words in a L-sized window

Matrix element:

  • PPMI(w, c) = max(0, PMI(w, c)),

where

PMI(w, c) = log

𝑃(&, ()

𝑃 & 𝑃(()

= log + &, ( |(&,()|

+ & +(()

I saw a

cat

the garden …

cute

grey

playing

in

contexts for cat

2-sized window for cat

55 of 96

Positive Pointwise Mutual Information (PPMI)

Context:

  • surrounding words in a L-sized window

contexts for cat

Matrix element:

  • PPMI(w, c) = max(0, PMI(w, c)),

where

PMI(w, c) = log

𝑃(&, ()

= log + &, ( |(&,()|

𝑃 & 𝑃(() + & +(()

I saw a

cat

the garden …

cute

grey

playing

in

2-sized window for cat

How much one variable tells about another

56 of 96

Latent Semantic Analysis (LSA): Understanding Documents

Context:

  • document d (from a collection D)

words

documents

Each element is the association between a word and a document

57 of 96

Latent Semantic Analysis (LSA): Understanding Documents

Context:

  • document d (from a collection D)

Matrix element:

  • tf-idf(w, d, D) = tf(w, d) ! idf(w, D)

N(w, d)

log

|D|

|{d D: w D}|

term frequency

inverse document frequency

words

documents

Each element is the association between a word and a document

58 of 96

Count-Based Methods: Idea

50

How: Put this information manually based on global corpus statistics.

We have to put information about contexts into word vectors.

Let’s remember our main idea:

Count-based methods take this idea quite literally :)

59 of 96

What is going to happen:

  • Why do we need word representations?
  • One-hot Vectors
  • Distributional Semantics

  • Word2Vec (Prediction-based Method)
  • GloVe
  • Evaluation
  • Analysis and Interpretability

51

Count-Based Methods

60 of 96

What is going to happen:

  • Why do we need word representations?
  • One-hot Vectors
  • Distributional Semantics
  • Count-Based Methods

  • GloVe
  • Evaluation
  • Analysis and Interpretability

52

Word2Vec (Prediction-based Method)

61 of 96

Word2Vec

(Prediction-Based Method)

62 of 96

What’s inside:

Word2Vec

  • Idea
  • Objective Function
  • Training Procedure
  • Faster Training: Negative Sampling
  • Word2Vec versions: Skip-Gram vs CBOW

Final Notes

63 of 96

What’s inside:

Word2Vec

Final Notes

  • Idea
  • Objective Function
  • Training Procedure
  • Faster Training: Negative Sampling
  • Word2Vec versions: Skip-Gram vs CBOW

64 of 96

Word2Vec: Idea

56

We have to put information about contexts into word vectors.

Let’s remember our main idea:

65 of 96

Word2Vec: Idea

57

How: Learn word vectors by teaching them to predict contexts.

We have to put information about contexts into word vectors.

Let’s remember our main idea:

Word2Vec uses this idea differently from count-based methods:

66 of 96

Word2Vec: Idea

58

How: Learn word vectors by teaching them to predict contexts.

  • Learned parameters: word vectors

  • Goal: make each vector “know” about the contexts of its word

  • How: train vectors to predict possible contexts from words (or, alternatively, words from contexts)

67 of 96

Word2Vec: High-Level pipeline

  • take a huge text corpus

I saw a cute grey

cat playing

in the garden …

68 of 96

Word2Vec: High-Level pipeline

60

Lena: Visualization idea is from the Stanford CS224n course.

I

saw

a

cat playing

in the garden …

cute

grey

𝑤𝑡

𝑤𝑡#$

𝑤𝑡#2

𝑤𝑡&$

𝑤𝑡&2

  • take a huge text corpus
  • go over the text with a sliding window, moving one word at a time.

69 of 96

Word2Vec: High-Level pipeline

61

Lena: Visualization idea is from the Stanford CS224n course.

I

saw

a

cat playing

in the garden …

cute

grey

𝑤𝑡

𝑤𝑡#$ 𝑤𝑡#2

𝑤𝑡&2 𝑤𝑡&$

context words

central word

context words

  • take a huge text corpus
  • go over the text with a sliding window, moving one word at a time.

70 of 96

Word2Vec: High-Level pipeline

62

I

saw a cute grey cat playing

𝑤𝑡 𝑤𝑡#$ 𝑤𝑡#2

Lena: Visualization idea is from the Stanford CS224n course.

in the garden …

𝑤𝑡&2 𝑤𝑡&$

context words

central word

context words

𝑃(𝑤𝑡&2|𝑤𝑡) 𝑃(𝑤𝑡&$|𝑤𝑡) 𝑃(𝑤𝑡#$|𝑤𝑡) 𝑃(𝑤𝑡#2|𝑤𝑡)

  • take a huge text corpus
  • go over the text with a sliding window, moving one word at a time.
  • for the central word, compute probabilities of context words;

71 of 96

Word2Vec: High-Level pipeline

63

I

saw a cute grey cat playing

𝑤𝑡 𝑤𝑡#$ 𝑤𝑡#2

Lena: Visualization idea is from the Stanford CS224n course.

in the garden …

𝑤𝑡&2 𝑤𝑡&$

context words

central word

context words

  • take a huge text corpus
  • go over the text with a sliding window, moving one word at a time.
  • for the central word, compute probabilities of context words;
  • adjust the vectors to increase these probabilities.

𝑃(𝑤𝑡&2|𝑤𝑡) 𝑃(𝑤𝑡&$|𝑤𝑡) 𝑃(𝑤𝑡#$|𝑤𝑡) 𝑃(𝑤𝑡#2|𝑤𝑡)

72 of 96

Word2Vec: High-Level pipeline

64

I

saw

a cute grey

Lena: Visualization idea is from the Stanford CS224n course.

cat playing

𝑤𝑡#2

in the garden …

𝑤𝑡

𝑤𝑡#$

𝑤𝑡&2 𝑤𝑡&$

context words

central word

context words

  • take a huge text corpus
  • go over the text with a sliding window, moving one word at a time.
  • for the central word, compute probabilities of context words;
  • adjust the vectors to increase these probabilities.

𝑃(𝑤𝑡&2|𝑤𝑡) 𝑃(𝑤𝑡&$|𝑤𝑡) 𝑃(𝑤𝑡#$|𝑤𝑡) 𝑃(𝑤𝑡#2|𝑤𝑡)

73 of 96

Word2Vec: High-Level pipeline

65

I saw

a

the garden …

Lena: Visualization idea is from the Stanford CS224n course.

cute

grey cat playing in

𝑤𝑡 𝑤𝑡#$ 𝑤𝑡#2

𝑤𝑡&2 𝑤𝑡&$

context words

central word

context words

  • take a huge text corpus
  • go over the text with a sliding window, moving one word at a time.
  • for the central word, compute probabilities of context words;
  • adjust the vectors to increase these probabilities.

𝑃(𝑤𝑡&2|𝑤𝑡) 𝑃(𝑤𝑡&$|𝑤𝑡) 𝑃(𝑤𝑡#$|𝑤𝑡) 𝑃(𝑤𝑡#2|𝑤𝑡)

74 of 96

Word2Vec: High-Level pipeline

66

I saw a

the garden …

Lena: Visualization idea is from the Stanford CS224n course.

cute

grey

cat playing

in

𝑤𝑡

𝑤𝑡#$ 𝑤𝑡#2

context words

𝑤𝑡&2 𝑤𝑡&$

context words

central word

  • take a huge text corpus
  • go over the text with a sliding window, moving one word at a time.
  • for the central word, compute probabilities of context words;
  • adjust the vectors to increase these probabilities.

𝑃(𝑤𝑡&2|𝑤𝑡) 𝑃(𝑤𝑡&$|𝑤𝑡) 𝑃(𝑤𝑡#$|𝑤𝑡) 𝑃(𝑤𝑡#2|𝑤𝑡)

75 of 96

Word2Vec: High-Level pipeline

67

I saw a cute

cat

garden …

grey

playing

in

the

𝑤𝑡

𝑤𝑡#$ 𝑤𝑡#2

context words

𝑤𝑡&2 𝑤𝑡&$

context words

central word

  • take a huge text corpus
  • go over the text with a sliding window, moving one word at a time.
  • for the central word, compute probabilities of context words;
  • adjust the vectors to increase these probabilities.

𝑃(𝑤𝑡&2|𝑤𝑡) 𝑃(𝑤𝑡&$|𝑤𝑡) 𝑃(𝑤𝑡#$|𝑤𝑡) 𝑃(𝑤𝑡#2|𝑤𝑡)

Lena: Visualization idea is from the Stanford CS224n course.

76 of 96

Word2Vec: High-Level pipeline

68

I saw a cute grey

cat

playing

in the garden …

𝑤𝑡 𝑤𝑡#$ 𝑤𝑡#2

Lena: Visualization idea is from the Stanford CS224n course.

𝑤𝑡&2 𝑤𝑡&$

context words

central word

context words

  • take a huge text corpus
  • go over the text with a sliding window, moving one word at a time.
  • for the central word, compute probabilities of context words;
  • adjust the vectors to increase these probabilities.

𝑃(𝑤𝑡&2|𝑤𝑡) 𝑃(𝑤𝑡&$|𝑤𝑡) 𝑃(𝑤𝑡#$|𝑤𝑡) 𝑃(𝑤𝑡#2|𝑤𝑡)

77 of 96

What’s inside:

Word2Vec

  • Idea
  • Objective Function
  • Training Procedure
  • Faster Training: Negative Sampling
  • Word2Vec versions: Skip-Gram vs CBOW

Final Notes

78 of 96

What’s inside:

Word2Vec

  • Idea

Final Notes

  • Objective Function
  • Training Procedure
  • Faster Training: Negative Sampling
  • Word2Vec versions: Skip-Gram vs CBOW

79 of 96

Word2Vec

80 of 96

word2Vec: Local contexts

Instead of entire documents, Word2Vec uses words k positions away from each center word.

These words are called context words.

Example for k=3:

“It was a bright cold day in April, and the clocks were striking”.

Center word: red (also called focus word).

Context words: blue (also called target words).

Word2Vec considers all words as center words, and all their context words.

80

81 of 96

Word2Vec: Data generation (window size = 2)

Example: d1 = “king brave man” , d2 = “queen beautiful women”

81

word

Word one hot encoding

neighbor

Neighbor one hot encoding

king

[1,0,0,0,0,0]

brave

[0,1,0,0,0,0]

king

[1,0,0,0,0,0]

man

[0,0,1,0,0,0]

brave

[0,1,0,0,0,0]

king

[1,0,0,0,0,0]

brave

[0,1,0,0,0,0]

man

[0,0,1,0,0,0]

man

[0,0,1,0,0,0]

king

[1,0,0,0,0,0]

man

[0,0,1,0,0,0]

brave

[0,1,0,0,0,0]

queen

[0,0,0,1,0,0]

beautiful

[0,0,0,0,1,0]

queen

[0,0,0,1,0,0]

women

[0,0,0,0,0,1]

beautiful

[0,0,0,0,1,0]

queen

[0,0,0,1,0,0]

beautiful

[0,0,0,0,1,0]

women

[0,0,0,0,0,1]

woman

[0,0,0,0,0,1]

queen

[0,0,0,1,0,0]

woman

[0,0,0,0,0,1]

beautiful

[0,0,0,0,1,0]

82 of 96

Word2Vec: Data generation (window size = 2)

Example: d1 = “king brave man” , d2 = “queen beautiful women”

82

word

Word one hot encoding

neighbor

Neighbor one hot encoding

king

[1,0,0,0,0,0]

brave

[0,1,1,0,0,0]

man

brave

[0,1,0,0,0,0]

king

[1,0,1,0,0,0]

man

man

[0,0,1,0,0,0]

king

[1,1,0,0,0,0]

brave

queen

[0,0,0,1,0,0]

beautiful

[0,0,0,0,1,1]

women

beautiful

[0,0,0,0,1,0]

queen

[0,0,0,1,0,1]

women

woman

[0,0,0,0,0,1]

queen

[0,0,0,1,1,0]

beautiful

83 of 96

Word2Vec: main context representation models

Continuous Bag of Words

(CBOW)

Skip-Ngram

83

Sum and projection

W-2

W-1

w2

w0

w1

Input

Output

Projection

W-2

W-1

w2

w0

w1

Input

Output

  • Word2Vec is a predictive model.
  • Will focus on Skip-Ngram model

84 of 96

How does word2Vec work?

Represent each word as a d dimensional vector.

Represent each context as a d dimensional vector.

Initialize all vectors to random weights.

Arrange vectors in two matrices, W and C.

84

85 of 96

Word2Vec : Neural Network representation

85

w1

w2

Hidden layer

Input layer

|Vw|

|Vc|

Output (sigmoid)

86 of 96

Word2Vec : Neural Network representation

86

1

0

0

0

0

0

0

1

1

0

0

0

w1

w2

Hidden layer

Input layer

king

brave

man

|Vw|

|Vc|

Output (sigmoid)

87 of 96

Word2Vec : Neural Network representation

87

0

1

0

0

0

0

1

0

1

0

0

0

w1

w2

Hidden layer

Input layer

Output (sigmoid)

|Vw|

|Vc|

brave

man

king

88 of 96

Word2Vec : Neural Network representation

88

0

0

1

0

0

0

1

1

0

0

0

0

w1

w2

Hidden layer

Input layer

Output (sigmoid)

|Vw|

|Vc|

man

king

brave

89 of 96

Word2Vec : Neural Network representation

89

0

0

0

1

0

0

0

0

0

0

1

1

w1

w2

Hidden layer

Input layer

Output (sigmoid)

|Vw|

|Vc|

queen

beautiful

women

90 of 96

Word2Vec : Neural Network representation

90

0

0

0

0

1

0

0

0

0

1

0

1

w1

w2

Hidden layer

Input layer

Output (sigmoid)

|Vw|

|Vc|

beautiful

queen

women

91 of 96

Word2Vec : Neural Network representation

91

0

0

0

0

1

0

0

0

0

1

1

0

w1

w2

Hidden layer

Input layer

Output (sigmoid)

|Vw|

|Vc|

women

queen

beautiful

92 of 96

Skip-Ngram: Training method

 

92

93 of 96

Skip-Ngram: Negative sampling

 

93

94 of 96

Skip-Ngram: Example

 

94

95 of 96

Skip-Ngram: How to select negative samples?

 

95

96 of 96

Relations Learned by Word2Vec

A relation is defined by the vector displacement in the first column. For each start word in the other column, the closest displaced word is shown.

“Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013

96