Features Engineering and Text Representation Learning
Natural Language Processing
Prof. Jebran Khan
Text Representation Learning
NLP Pipeline- Feature Engineering
NLP Pipeline- Feature Engineering
NLP Pipeline- Feature Engineering
Output:
NLP Pipeline- Feature Engineering
Bag of Words
A word of text.
A word is a token.
Tokens and features.
Few features of text.
m1:
m2:
m3:
m4:
Training data
a
word
of
text
a
word
is
a
token
tokens
and
features
few
features
of
text
Tokens
Bag of words
| a |
| word |
| of |
| text |
| is |
| token |
| tokens |
| and |
| features |
| few |
Features
One feature per unique token
Bag of Words: Example
A word of text.
A word is a token.
Tokens and features.
Few features of text.
m1:
m2:
m3:
m4:
| a |
| word |
| of |
| text |
| is |
| token |
| tokens |
| and |
| features |
| few |
Use bag of words when you have a lot of data, can use many features
| m1 | m2 | m3 | m4 |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| test1 |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
test1: Some features for a text example.
Selected Features
Training X
Test X
| m1 | m2 | m3 | m4 |
| 1 | 1 | 0 | 0 |
| 1 | 1 | 0 | 0 |
| 1 | 0 | 0 | 1 |
| 1 | 0 | 0 | 1 |
| 0 | 1 | 0 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 0 | 1 | 1 |
| 0 | 0 | 0 | 1 |
| test1 |
| 1 |
| 0 |
| 0 |
| 1 |
| 0 |
| 0 |
| 0 |
| 0 |
| 1 |
| 0 |
Out of
vocabulary
NLP Pipeline- Feature Engineering
Output:
NLP Pipeline- Feature Engineering
Normalize corpus
BoW features extraction
Features positions
Corpus representation
BoW feature vectors
NLP Pipeline- Feature Engineering
NLP Pipeline- Feature Engineering
Message 1: “Nah I don't think he goes to usf”
Message 2: “Text FA to 87121 to receive entry”
Nah I | I don’t | don’t think | think he | he goes | goes to | to usf | … | Text FA | FA to | 87121 to | To receive | receive entry |
0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 1 | 1 | 1 | 1 | 1 |
Message 2:
Use when you have a LOT of data, can use MANY features
N-Grams: Characters
Message 1: “Nah I don't think he goes to usf”
Message 2: “Text FA to 87121 to receive entry”
Na | ah | h <space> | <space> I | I <space> | <space> d | do | … | <space> e | en | nt | tr | ry |
0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 1 | 1 | 1 | 1 | 1 |
Message 2:
Helps with out of dictionary words & spelling errors
Fixed number of features for given N (but can be very large)
NLP Pipeline- Feature Engineering
Output:
NLP Pipeline- Feature Engineering
Normalize corpus
Bag of N-grams features vectors
NLP Pipeline- Feature Engineering
NLP Pipeline- Feature Engineering
TF-IDF�Term Frequency – Inverse Document Frequency
�TermFrequency(<term>, <document>) =
% of the words in <document> that are <term>
InverseDocumentFrequency(<term>, <documents>) =
log ( # documents / # documents that contain <term> )
| Nah | I | don't | think | he | goes | to | usf | Text | FA | 87121 | receive | entry |
BOW | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
TF-IDF | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | .099 | .099 | .099 | .099 | .099 |
Message 1: “Nah I don't think he goes to usf”
Message 2: “Text FA to 87121 to receive entry”
Message 2:
Importance to
Document
Novelty across
corpus
NLP Pipeline- Feature Engineering
Output:
NLP Pipeline- Feature Engineering
Normalize corpus
TF-IDF features vectors
NLP Pipeline- Feature Engineering
Collect words
Create bag of words for term frequency
NLP Pipeline- Feature Engineering
Create document frequency matrix
Calculate IDF
NLP Pipeline- Feature Engineering
Compute TF-IDF
Compute normalized TF-IDF
NLP Pipeline- Feature Engineering
TF-IDF for new example
NLP Pipeline- Feature Engineering
How to represent words?
N-gram language models
It is 76 F and .
P(w ∣ it is 76 F and)
[0.0001, 0.1, 0, 0, 0.002, …, 0.3, …, 0]
red sunny
Text classification
I like this movie.
I don’t like this movie.
[0, 1, 0, 0, 0, …, 1, …, 1]
[0, 1, 0, 1, 0, …, 1, …, 1]
don’t
P(y = 1 ∣ x) = σ(θ⊺w + b)
w(1)
w(2)
Representing words as discrete symbols
In traditional NLP, we regard words as discrete symbols: hotel, conference, motel — a localist representation
one 1, the rest 0’s
Words can be represented by one-hot vectors:
hotel = [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
motel = [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
Vector dimension = number of words in vocabulary (e.g., 500,000)
Challenge: How to compute similarity of two words?
Representing words by their context
Distributional hypothesis: words that occur in similar contexts tend to have similar meanings
J.R.Firth 1957
•
•
“You shall know a word by the company it keeps”
One of the most successful ideas of modern statistical NLP!
These context words will represent banking.
Distributional hypothesis
“tejuino”
C1: A bottle of is on the table.
C2: Everybody likes .
C3: Don’t have before you drive. C4: We make out of corn.
Distributional hypothesis
| C1 | C2 | C3 | C4 |
tejuino | 1 | 1 | 1 | 1 |
loud | 0 | 0 | 0 | 0 |
motor-oil | 1 | 0 | 0 | 0 |
tortillas | 0 | 1 | 0 | 1 |
choices | 0 | 1 | 0 | 0 |
wine | 1 | 1 | 1 | 0 |
C1: A bottle of is on the table.
C2: Everybody likes .
C3: Don’t have before you drive.
C4: We make out of corn.
“words that occur in similar contexts tend to have similar meanings”
Words as vectors
Words as vectors
Sparse vs dense vectors
Dense vectors
Why dense vectors?
•
•
•
Short vectors are easier to use as features in ML systems
Dense vectors may generalize better than storing explicit counts They do better at capturing synonymy
•
Different methods for getting dense vectors:
•
•
Singular value decomposition (SVD) word2vec and friends: “learn” the vectors!
NLP Pipeline- Word Embedding
POTUS
Yesterday The President called a press conference
Joe Biden
NLP Pipeline- Word Embedding
airplane =[0.7, 0.9, 0.9, 0.01, 0.35]
kite =[0.7, 0.9, 0.2, 0.01, 0.2]
| airplane | kite |
Sky | 0.7 | 0.7 |
Fly | 0.9 | 0.9 |
Transport | 0.9 | 0.2 |
Animal | 0.01 | 0.01 |
Eat | 0.35 | 0.2 |
NLP Pipeline- Word Embedding
Word Embedding – A distributed representation
Distributional Representation
Distributional Representation: Illustration
Word embeddings: properties
Word embeddings: properties
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
NLP Pipeline- Word Embedding
Analogy Test
NLP Pipeline- Word Embedding
Word embeddings: relationships
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
NLP Pipeline- Word Embedding
NLP Pipeline- Word Embedding
I am learning Natural Language Processing from GFG.
I am learning Natural _____?_____ Processing from GFG.
I am learning Natural Language Processing from GFG.
I am __?___ _____?_____ Language ___?___ ____?____ GFG.
Input Layer: Receives the target word.
Hidden Layer: Contains the word embedding.
Output Layer: Predicts context words within a certain window.
NLP Pipeline- Word Embedding
NLP Pipeline- Word Embedding
Output:
NLP Pipeline- Word Embedding
Output:
NLP Pipeline- Word Embedding
Output:
Evaluating Word Embeddings
Extrinsic vs intrinsic evaluation
Extrinsic evaluation
I don’t
0.31 0.01
(−0.28) (−0.91)
1.87
(0.03)
like
−3.17
(−0.18)
1.23
(1.59)
this movie
ML model
👎
Intrinsic evaluation
•
•
•
Evaluate on a specific/intermediate subtask Fast to compute
Not clear if it really helps the downstream task
Intrinsic evaluation
Word similarity
Example dataset: wordsim-353
353 pairs of words with human judgement
http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/
Cosine similarity:
Metric: Spearman rank correlation