Natural Language Processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.
The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation.
What is natural language processing used for?
Some of the main functions that natural language processing algorithms perform are:
Text classification. This involves assigning tags to texts to put them in categories. This can be useful for sentiment analysis, which helps the natural language processing algorithm determine the sentiment, or emotion behind a text. For example, when brand A is mentioned in X number of texts, the algorithm can determine how many of those mentions were positive and how many were negative. It can also be useful for intent detection, which helps predict what the speaker or writer may do based on the text they are producing.
Text extraction. This involves automatically summarizing text and finding important pieces of data.
One example of this is keyword extraction, which pulls the most important words from the text, which can be useful for search engine optimization. Doing this with natural language processing requires some programming
-- it is not completely automated.
However, there are plenty of simple keyword extraction tools that automate most of the process -- the user just has to set parameters within the program.
For example, a tool might pull out the most frequently used words in the text. Another example is named entity recognition, which extracts the names of people, places and other entities from text.
Machine translation. This is the process by which a computer translates text from one language, such as English, to another language, such as French, without human intervention.
Natural language generation. This involves using natural language processing algorithms to analyze unstructured data and automatically produce content based on that data. One example of this is in language models such as GPT3, which are able to analyze an unstructured text and then generate believable articles based on the text.
The functions listed above are used in a variety of real-world applications, including:
customer feedback analysis -- where AI analyzes social media reviews;
customer service automation -- where voice assistants on the other end of a customer service phone line are able to use speech recognition to understand what the customer is saying, so that it can direct the call correctly;
automatic translation -- using tools such as Google Translate, Bing Translator and Translate Me;
academic research and analysis -- where AI is able to analyze huge amounts of academic material and research papers not just based on the metadata of the text, but the text itself;
insights to predict, and ideally prevent, disease;
word processors used for plagiarism and proofreading -- using tools such as Grammarly and Microsoft Word;
stock forecasting and insights into financial trading -- using AI to analyze market history and 10-K documents, which contain comprehensive summaries about a company's financial performance;
talent recruitment in human resources;
automation of routine litigation tasks -- one example is the artificially intelligent attorney.
Research being done on natural language processing revolves around search, especially Enterprise search. This involves having users query data sets in the form of a question that they might pose to another person. The machine interprets the important elements of the human language sentence, which correspond to specific features in a data set, and returns an answer.
analysis and categorization of medical records
NLP can be used to interpret free, unstructured text and make it analyzable. There is a tremendous amount of information stored in free text files, such as patients' medical records. Before deep learning-based NLP models, this information was inaccessible to computer-assisted analysis and could not be analyzed in any systematic way. With NLP analysts can sift through massive amounts of free text to find relevant information.
Sentiment analysis is another primary use case for NLP. Using sentiment analysis, data scientists can assess comments on social media to see how their business's brand is performing, or review notes from customer service teams to identify areas where people want the business to perform better.
Word Embeddings
In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis.
Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that words that are closer in the vector space are expected to be similar in meaning.[1]
Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.
What is going to happen:
12
What is going to happen:
Why do we need word representations?
•
13
Why do we need word representations?
I saw a cat.
Text (your input)
14
Why do we need word representations?
I saw a cat.
Text (your input)
Sequence of tokens
I
saw a
cat
.
15
Why do we need word representations?
I saw a cat.
Text (your input)
Sequence of tokens
Word representation - vector (input for your model/algorithm)
I
saw a
cat
.
16
Why do we need word representations?
I saw a cat.
Text (your input)
Sequence of tokens
Your algorithm (e.g., neural network)
Word representation - vector (input for your model/algorithm)
Any algorithm for solving a task
I
saw a
cat
.
17
How it works: Look-up Table
Token index in the vocabulary
39 1592 10 2548 5
39
I
saw a
cat
.
Embedding dimension
18
Vocabulary size
How it works: Look-up Table
Token index in the vocabulary
39 1592 10 2548 5
1592
I
saw a
cat
.
Embedding dimension
19
Vocabulary size
How it works: Look-up Table
20
|
|
|
|
|
|
|
|
|
|
Token index in the vocabulary
39 1592 10 2548 5
10
I
saw a
cat
.
Embedding dimension
Vocabulary size
How it works: Look-up Table
21
Token index in the vocabulary
39 1592 10 2548 5
2548
I
saw a
cat
.
Embedding dimension
Vocabulary size
How it works: Look-up Table
22
|
|
|
|
|
|
|
|
|
|
Token index in the vocabulary
39 1592 10 2548 5
5
I
saw a
cat
.
Embedding dimension
Vocabulary size
How it works: Look-up Table
23
|
|
|
|
|
|
|
|
|
|
Token index in the vocabulary
39 1592 10 2548 5
5
I
saw a
cat
.
Embedding dimension
Vocabulary size
“Look up” a token embedding in the table
Note UNKs: Out-of-Vocabulary Tokens
24
I
saw a
&%!
.
I
saw
a UNK .
not in the vocabulary
Vocabulary is chosen in advance
Therefore, some tokens may be “unknown” – you can use a special token for them
How can we gat word representations?
25
I saw a cat.
Your algorithm (e.g., neural network)
I
saw a
cat
.
In the following:
How can we get these representations?
What is going to happen:
26
Why do we need word representations?
•
What is going to happen:
•
27
One-hot Vectors
One-Hot Vectors: Represent Words as Discrete Symbols
28
0…010…0….0…0
dog cat table
0…0…010….0…0
0…0…0….0010…
One is 1, the rest are 0
Embedding dimension = vocabulary size
One-Hot Vectors: Represent Words as Discrete Symbols
29
0…010…0….0…0
dog cat table
0…0…010….0…0
0…0…0….0010…
One is 1, the rest are 0
Embedding dimension = vocabulary size
Any problems?
One-Hot Vectors: Represent Words as Discrete Symbols
30
0…010…0….0…0
dog cat table
0…0…010….0…0
0…0…0….0010…
One is 1, the rest are 0
Embedding dimension = vocabulary size
dog as it is to table!
Problems:
One-Hot Vectors: Represent Words as Discrete Symbols
31
0…010…0….0…0
dog cat table
0…0…010….0…0
0…0…0….0010…
One is 1, the rest are 0
Embedding dimension = vocabulary size
Problems:
e.g., cat is as close to
dog as it is to table!
What is meaning?
What is meaning?
32
Do you know what the word tezgüino means ?
(We hope you do not)
What is meaning?
33
Now look how this word is used in different contexts:
A bottle of tezgüino is on the table. Everyone likes tezgüino.
Tezgüino makes you drunk.
We make tezgüino out of corn.
Can you understand what tezgüino means ?
What is meaning?
34
Now look how this word is used in different contexts:
A bottle of tezgüino is on the table. Everyone likes tezgüino.
Tezgüino makes you drunk.
We make tezgüino out of corn.
With context, you can understand the meaning!
Tezgüino is a kind of alcoholic beverage made from corn.
35
What is meaning?
How did you do this?
What is meaning?
36
(1) A bottle of
is on the table.
(4) We make out of corn.
What other words fit into these contexts ?
What is meaning?
37
What other words fit into these contexts ?
(1) (2) (3) (4) …
tezgüino 1 1 1 1
loud 0 0 0 0
motor oil 1 0 0 1
tortillas 0 1 0 1
wine 1 1 1 0
(1) A bottle of
is on the table.
(4) We make out of corn.
contexts
rows show contextual properties: 1 if a word can appear in the context, 0 if not
What is meaning?
38
(1) A bottle of
is on the table.
(1) (2) (3) (4) …
tezgüino loud motor oil tortillas wine
1 1 1 1
0 0 0 0
1 0 0 1
0 1 0 1
1 1 1 0
rows are similar
What is meaning?
39
(1) A bottle of
is on the table.
(1) (2) (3) (4) …
tezgüino loud motor oil tortillas wine
1 1 1 1
0 0 0 0
1 0 0 1
0 1 0 1
1 1 1 0
rows are similar
meanings of the words are similar
Is this true?
What is meaning?
40
This is the distributional hypothesis
(1) A bottle of
is on the table.
tezgüino loud motor oil tortillas wine
1 1 1 1
0 0 0 0
1 0 0 1
0 1 0 1
1 1 1 0
(1) (2) (3) (4) …
rows are similar
meanings of the words are similar
Distributional Hypothesis
41
Words which frequently appear in similar contexts have similar meaning.
(Harris 1954, Firth 1957)
Distributional Hypothesis
42
Words which frequently appear in similar contexts have similar meaning.
(Harris 1954, Firth 1957)
This can be used in practice to build word vectors!
Distributional Hypothesis
43
Words which frequently appear in similar contexts have similar meaning.
We have to put information about contexts into word vectors.
(Harris 1954, Firth 1957)
Main idea:
What comes next: different ways to do this
What is going to happen:
•
44
Distributional Semantics
What is going to happen:
•
45
Count-Based Methods
Count-Based Methods
Count-Based Methods Idea
47
We have to put information about contexts into word vectors.
Let’s remember our main idea:
Count-Based Methods: Idea
48
How: Put this information manually based on global corpus statistics.
We have to put information about contexts into word vectors.
Let’s remember our main idea:
Count-based methods take this idea quite literally :)
Count-Based Methods: The General Pipeline
Count-Based Methods: The General Pipeline
Count-Based Methods: The General Pipeline
Need to define:
Simple: Co-Occurrence Counts
… I saw a
cat
the garden …
cute
grey
playing
in
contexts for cat
2-sized window for cat
N(w, c) – number of times word w appears in context c
•
Context:
Matrix element:
Simple: Co-Occurrence Counts
… I saw a
cat
the garden …
cute
grey
playing
in
contexts for cat
2-sized window for cat
Context:
Matrix element:
NLP Course For You
N(w, c) – number of times word w appears in context c
•
Positive Pointwise Mutual Information (PPMI)
Context:
Matrix element:
where
PMI(w, c) = log
𝑃(&, ()
𝑃 & 𝑃(()
= log + &, ( |(&,()|
+ & +(()
… I saw a
cat
the garden …
cute
grey
playing
in
contexts for cat
2-sized window for cat
Positive Pointwise Mutual Information (PPMI)
Context:
contexts for cat
Matrix element:
where
PMI(w, c) = log
𝑃(&, ()
= log + &, ( |(&,()|
𝑃 & 𝑃(() + & +(()
… I saw a
cat
the garden …
cute
grey
playing
in
2-sized window for cat
How much one variable tells about another
Latent Semantic Analysis (LSA): Understanding Documents
Context:
| | |
| | |
| | |
words
documents
Each element is the association between a word and a document
Latent Semantic Analysis (LSA): Understanding Documents
Context:
Matrix element:
N(w, d)
log
|D|
|{d ∈ D: w ∈ D}|
term frequency
inverse document frequency
| | |
| | |
| | |
words
documents
Each element is the association between a word and a document
Count-Based Methods: Idea
50
How: Put this information manually based on global corpus statistics.
We have to put information about contexts into word vectors.
Let’s remember our main idea:
Count-based methods take this idea quite literally :)
What is going to happen:
•
51
Count-Based Methods
What is going to happen:
•
52
Word2Vec (Prediction-based Method)
Word2Vec
(Prediction-Based Method)
What’s inside:
Word2Vec
Final Notes
•
What’s inside:
Word2Vec
Final Notes
•
Word2Vec: Idea
56
We have to put information about contexts into word vectors.
Let’s remember our main idea:
Word2Vec: Idea
57
How: Learn word vectors by teaching them to predict contexts.
We have to put information about contexts into word vectors.
Let’s remember our main idea:
Word2Vec uses this idea differently from count-based methods:
Word2Vec: Idea
58
How: Learn word vectors by teaching them to predict contexts.
Word2Vec: High-Level pipeline
… I saw a cute grey
cat playing
in the garden …
Word2Vec: High-Level pipeline
60
Lena: Visualization idea is from the Stanford CS224n course.
I
saw
a
cat playing
in the garden …
cute
grey
…
𝑤𝑡
𝑤𝑡#$
𝑤𝑡#2
𝑤𝑡&$
𝑤𝑡&2
Word2Vec: High-Level pipeline
61
Lena: Visualization idea is from the Stanford CS224n course.
I
saw
a
cat playing
in the garden …
cute
grey
…
𝑤𝑡
𝑤𝑡#$ 𝑤𝑡#2
𝑤𝑡&2 𝑤𝑡&$
context words
central word
context words
Word2Vec: High-Level pipeline
62
I
saw a cute grey cat playing
𝑤𝑡 𝑤𝑡#$ 𝑤𝑡#2
Lena: Visualization idea is from the Stanford CS224n course.
in the garden …
…
𝑤𝑡&2 𝑤𝑡&$
context words
central word
context words
𝑃(𝑤𝑡&2|𝑤𝑡) 𝑃(𝑤𝑡&$|𝑤𝑡) 𝑃(𝑤𝑡#$|𝑤𝑡) 𝑃(𝑤𝑡#2|𝑤𝑡)
Word2Vec: High-Level pipeline
63
I
saw a cute grey cat playing
𝑤𝑡 𝑤𝑡#$ 𝑤𝑡#2
Lena: Visualization idea is from the Stanford CS224n course.
in the garden …
…
𝑤𝑡&2 𝑤𝑡&$
context words
central word
context words
𝑃(𝑤𝑡&2|𝑤𝑡) 𝑃(𝑤𝑡&$|𝑤𝑡) 𝑃(𝑤𝑡#$|𝑤𝑡) 𝑃(𝑤𝑡#2|𝑤𝑡)
Word2Vec: High-Level pipeline
64
… I
saw
a cute grey
Lena: Visualization idea is from the Stanford CS224n course.
cat playing
𝑤𝑡#2
in the garden …
𝑤𝑡
𝑤𝑡#$
𝑤𝑡&2 𝑤𝑡&$
context words
central word
context words
𝑃(𝑤𝑡&2|𝑤𝑡) 𝑃(𝑤𝑡&$|𝑤𝑡) 𝑃(𝑤𝑡#$|𝑤𝑡) 𝑃(𝑤𝑡#2|𝑤𝑡)
Word2Vec: High-Level pipeline
65
… I saw
a
the garden …
Lena: Visualization idea is from the Stanford CS224n course.
cute
grey cat playing in
𝑤𝑡 𝑤𝑡#$ 𝑤𝑡#2
𝑤𝑡&2 𝑤𝑡&$
context words
central word
context words
𝑃(𝑤𝑡&2|𝑤𝑡) 𝑃(𝑤𝑡&$|𝑤𝑡) 𝑃(𝑤𝑡#$|𝑤𝑡) 𝑃(𝑤𝑡#2|𝑤𝑡)
Word2Vec: High-Level pipeline
66
… I saw a
the garden …
Lena: Visualization idea is from the Stanford CS224n course.
cute
grey
cat playing
in
𝑤𝑡
𝑤𝑡#$ 𝑤𝑡#2
context words
𝑤𝑡&2 𝑤𝑡&$
context words
central word
𝑃(𝑤𝑡&2|𝑤𝑡) 𝑃(𝑤𝑡&$|𝑤𝑡) 𝑃(𝑤𝑡#$|𝑤𝑡) 𝑃(𝑤𝑡#2|𝑤𝑡)
Word2Vec: High-Level pipeline
67
… I saw a cute
cat
garden …
grey
playing
in
the
𝑤𝑡
𝑤𝑡#$ 𝑤𝑡#2
context words
𝑤𝑡&2 𝑤𝑡&$
context words
central word
𝑃(𝑤𝑡&2|𝑤𝑡) 𝑃(𝑤𝑡&$|𝑤𝑡) 𝑃(𝑤𝑡#$|𝑤𝑡) 𝑃(𝑤𝑡#2|𝑤𝑡)
Lena: Visualization idea is from the Stanford CS224n course.
Word2Vec: High-Level pipeline
68
… I saw a cute grey
cat
playing
in the garden …
𝑤𝑡 𝑤𝑡#$ 𝑤𝑡#2
Lena: Visualization idea is from the Stanford CS224n course.
𝑤𝑡&2 𝑤𝑡&$
context words
central word
context words
𝑃(𝑤𝑡&2|𝑤𝑡) 𝑃(𝑤𝑡&$|𝑤𝑡) 𝑃(𝑤𝑡#$|𝑤𝑡) 𝑃(𝑤𝑡#2|𝑤𝑡)
What’s inside:
Word2Vec
Final Notes
•
What’s inside:
Word2Vec
Final Notes
•
Word2Vec
word2Vec: Local contexts
Instead of entire documents, Word2Vec uses words k positions away from each center word.
These words are called context words.
Example for k=3:
“It was a bright cold day in April, and the clocks were striking”.
Center word: red (also called focus word).
Context words: blue (also called target words).
Word2Vec considers all words as center words, and all their context words.
80
Word2Vec: Data generation (window size = 2)
Example: d1 = “king brave man” , d2 = “queen beautiful women”
81
word | Word one hot encoding | neighbor | Neighbor one hot encoding |
king | [1,0,0,0,0,0] | brave | [0,1,0,0,0,0] |
king | [1,0,0,0,0,0] | man | [0,0,1,0,0,0] |
brave | [0,1,0,0,0,0] | king | [1,0,0,0,0,0] |
brave | [0,1,0,0,0,0] | man | [0,0,1,0,0,0] |
man | [0,0,1,0,0,0] | king | [1,0,0,0,0,0] |
man | [0,0,1,0,0,0] | brave | [0,1,0,0,0,0] |
queen | [0,0,0,1,0,0] | beautiful | [0,0,0,0,1,0] |
queen | [0,0,0,1,0,0] | women | [0,0,0,0,0,1] |
beautiful | [0,0,0,0,1,0] | queen | [0,0,0,1,0,0] |
beautiful | [0,0,0,0,1,0] | women | [0,0,0,0,0,1] |
woman | [0,0,0,0,0,1] | queen | [0,0,0,1,0,0] |
woman | [0,0,0,0,0,1] | beautiful | [0,0,0,0,1,0] |
Word2Vec: Data generation (window size = 2)
Example: d1 = “king brave man” , d2 = “queen beautiful women”
82
word | Word one hot encoding | neighbor | Neighbor one hot encoding |
king | [1,0,0,0,0,0] | brave | [0,1,1,0,0,0] |
man | |||
brave | [0,1,0,0,0,0] | king | [1,0,1,0,0,0] |
man | |||
man | [0,0,1,0,0,0] | king | [1,1,0,0,0,0] |
brave | |||
queen | [0,0,0,1,0,0] | beautiful | [0,0,0,0,1,1] |
women | |||
beautiful | [0,0,0,0,1,0] | queen | [0,0,0,1,0,1] |
women | |||
woman | [0,0,0,0,0,1] | queen | [0,0,0,1,1,0] |
beautiful |
Word2Vec: main context representation models
Continuous Bag of Words
(CBOW)
Skip-Ngram
83
Sum and projection
W-2
W-1
w2
w0
w1
Input
Output
Projection
W-2
W-1
w2
w0
w1
Input
Output
How does word2Vec work?
Represent each word as a d dimensional vector.
Represent each context as a d dimensional vector.
Initialize all vectors to random weights.
Arrange vectors in two matrices, W and C.
84
Word2Vec : Neural Network representation
85
w1
w2
Hidden layer
Input layer
|Vw|
|Vc|
Output (sigmoid)
Word2Vec : Neural Network representation
86
1
0
0
0
0
0
0
1
1
0
0
0
w1
w2
Hidden layer
Input layer
king
brave
man
|Vw|
|Vc|
Output (sigmoid)
Word2Vec : Neural Network representation
87
0
1
0
0
0
0
1
0
1
0
0
0
w1
w2
Hidden layer
Input layer
Output (sigmoid)
|Vw|
|Vc|
brave
man
king
Word2Vec : Neural Network representation
88
0
0
1
0
0
0
1
1
0
0
0
0
w1
w2
Hidden layer
Input layer
Output (sigmoid)
|Vw|
|Vc|
man
king
brave
Word2Vec : Neural Network representation
89
0
0
0
1
0
0
0
0
0
0
1
1
w1
w2
Hidden layer
Input layer
Output (sigmoid)
|Vw|
|Vc|
queen
beautiful
women
Word2Vec : Neural Network representation
90
0
0
0
0
1
0
0
0
0
1
0
1
w1
w2
Hidden layer
Input layer
Output (sigmoid)
|Vw|
|Vc|
beautiful
queen
women
Word2Vec : Neural Network representation
91
0
0
0
0
1
0
0
0
0
1
1
0
w1
w2
Hidden layer
Input layer
Output (sigmoid)
|Vw|
|Vc|
women
queen
beautiful
Skip-Ngram: Training method
92
Skip-Ngram: Negative sampling
93
Skip-Ngram: Example
94
Skip-Ngram: How to select negative samples?
95
Relations Learned by Word2Vec
A relation is defined by the vector displacement in the first column. For each start word in the other column, the closest displaced word is shown.
“Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013
96