JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 56

Features Engineering and Text Representation Learning

Natural Language Processing

Prof. Jebran Khan

2 of 56

Text Representation Learning

Machine does not understand text

We need a numeric representation

Unlike images (RGB matrix), for text there is no obvious way
Representation learning is a set of techniques that learn a feature

A transformation of the raw data to a representation that can be effectively exploited in machine learning tasks

Part of feature engineering/learning
Get rid of “hand-designed” features and representation

3 of 56

NLP Pipeline- Feature Engineering

Main task is to represent the text in the numeric vector in such a way that the ML algorithm can understand the text attribute
There are two most common approaches for Text Representation

Classical or Traditional Approach

In the traditional approach, we create a vocabulary of unique words assign a unique id (integer value) for each word. and then replace each word of a sentence with its unique id
Each word of vocabulary is treated as a feature. So, when the vocabulary is large then the feature size will become very large
One Hot Encoder, Bag of Word(Bow), Bag of n-grams, TF-IDF etc.

Neural Approach (Word embedding)

The above technique is not very good for complex tasks like Text Generation, Text summarization, etc.
Because they can’t understand the contextual meaning of words
Try to incorporate the contextual meaning of the words

4 of 56

NLP Pipeline- Feature Engineering

5 of 56

NLP Pipeline- Feature Engineering

Limitations

Sparsity

Only a single sentence creates a vector of n*m size where n is the length of sentence m is a number of unique words in a document
80% of values in a vector is zero

No fixed Size

Each document is of a different length which creates vectors of different sizes
cannot feed to the model

Does not capture semantics

The core idea is we have to convert text into numbers by keeping in mind that the actual meaning of a sentence should be observed in numbers
that are not seen in one-hot encoding

Source: https://www.geeksforgeeks.org/natural-language-processing-nlp-pipeline/

Output:

6 of 56

NLP Pipeline- Feature Engineering

7 of 56

Bag of Words

A word of text.

A word is a token.

Tokens and features.

Few features of text.

m1:

m2:

m3:

m4:

Training data

word

text

word

token

tokens

and

features

few

features

text

Tokens

Bag of words

	a
	word
	of
	text
	is
	token
	tokens
	and
	features
	few

Features

One feature per unique token

8 of 56

Bag of Words: Example

A word of text.

A word is a token.

Tokens and features.

Few features of text.

m1:

m2:

m3:

m4:

	a
	word
	of
	text
	is
	token
	tokens
	and
	features
	few

Use bag of words when you have a lot of data, can use many features

m1	m2	m3	m4

	test1

test1: Some features for a text example.

Selected Features

Training X

Test X

m1	m2	m3	m4
1	1	0	0
1	1	0	0
1	0	0	1
1	0	0	1
0	1	0	0
0	1	0	0
0	0	1	0
0	0	1	0
0	0	1	1
0	0	0	1

	test1
	1
	0
	0
	1
	0
	0
	0
	0
	1
	0

Out of

vocabulary

9 of 56

NLP Pipeline- Feature Engineering

Source: https://www.geeksforgeeks.org/natural-language-processing-nlp-pipeline/

Output:

10 of 56

NLP Pipeline- Feature Engineering

Normalize corpus

BoW features extraction

Features positions

Corpus representation

BoW feature vectors

11 of 56

NLP Pipeline- Feature Engineering

Advantages

Simple and intuitive

Easy to implement the technique

Fix size vector

Unlike one-hot encoding it ignores the new words and only consider the vocabulary words to creates a vector of fix size

Limitations

Out of vocabulary situation

It keeps count of vocabulary words so if new words come in a sentence it simply ignores it

Sparsity

When a large vocabulary, and the document contains a few repeated terms then it creates a sparse array

Not considering ordering is an issue

It is difficult to estimate the semantics of the document

12 of 56

NLP Pipeline- Feature Engineering

N-grams

Instead of using single tokens as features, use series of N tokens
“down the bank” vs “from the bank”

Message 1: “Nah I don't think he goes to usf”

Message 2: “Text FA to 87121 to receive entry”

Nah I	I don’t	don’t think	think he	he goes	goes to	to usf	…	Text FA	FA to	87121 to	To receive	receive entry
0	0	0	0	0	0	0	…	1	1	1	1	1

Message 2:

Use when you have a LOT of data, can use MANY features

13 of 56

N-Grams: Characters

Instead of using series of tokens, use series of characters

Message 1: “Nah I don't think he goes to usf”

Message 2: “Text FA to 87121 to receive entry”

Na	ah	h <space>	<space> I	I <space>	<space> d	do	…	<space> e	en	nt	tr	ry
0	0	0	0	0	0	0	…	1	1	1	1	1

Message 2:

Helps with out of dictionary words & spelling errors

Fixed number of features for given N (but can be very large)

14 of 56

NLP Pipeline- Feature Engineering

Bag of n-grams

In Bag of Words, there is no consideration of the phrases or word order
Bag of n-gram tries to solve this problem by breaking text into chunks of n continuous words

Source: https://www.geeksforgeeks.org/natural-language-processing-nlp-pipeline/

Output:

15 of 56

NLP Pipeline- Feature Engineering

Normalize corpus

Bag of N-grams features vectors

16 of 56

NLP Pipeline- Feature Engineering

Advantages

Able to capture semantic meaning of the sentence

As we use Bigram or trigram then it capture the word relationship by exploiting the word sequence in sentences
Intuitive and easy to implement

Implementation of N-Gram is straightforward with a little bit of modification in Bag of words

Disadvantages

Increased vocabulary size

As we move from unigram to N-Gram then dimension of vector formation or vocabulary increases
it takes more time in computation and prediction

No solution for out of vocabulary terms

We do not have a way other than ignoring the new words in a new sentence

17 of 56

NLP Pipeline- Feature Engineering

TF-IDF (Term Frequency – Inverse Document Frequency)

In previous techniques, each word is treated equally
TF-IDF tries to quantify the importance of a given word relative to the other word in the corpus
Mainly used in Information retrieval
Term Frequency (TF):

TF measures how often a word occurs in the given document
It is the ratio of the number of occurrences of a term or word (t ) in a given document (d) to the total number of terms in a given document (d)

Inverse document frequency (IDF)

IDF measures the importance of the word across the corpus
It reduce the weight of the most frequent terms, and increase the weight of rare terms

TF-IDF score is the product of TF and IDF

Source: https://www.geeksforgeeks.org/natural-language-processing-nlp-pipeline/

18 of 56

TF-IDF�Term Frequency – Inverse Document Frequency

Instead of using binary: ContainsWord(<term>)
Use numeric importance score TF-IDF:

�TermFrequency(<term>, <document>) =

% of the words in <document> that are <term>

InverseDocumentFrequency(<term>, <documents>) =

log ( # documents / # documents that contain <term> )

	Nah	I	don't	think	he	goes	to	usf	Text	FA	87121	receive	entry
BOW	0	0	0	0	0	0	1	0	1	1	1	1	1
TF-IDF	0	0	0	0	0	0	0	0	.099	.099	.099	.099	.099

Message 1: “Nah I don't think he goes to usf”

Message 2: “Text FA to 87121 to receive entry”

Message 2:

Importance to

Document

Novelty across

corpus

19 of 56

NLP Pipeline- Feature Engineering

Source: https://www.geeksforgeeks.org/natural-language-processing-nlp-pipeline/

Output:

20 of 56

NLP Pipeline- Feature Engineering

Normalize corpus

TF-IDF features vectors

21 of 56

NLP Pipeline- Feature Engineering

Collect words

Create bag of words for term frequency

22 of 56

NLP Pipeline- Feature Engineering

Create document frequency matrix

Calculate IDF

23 of 56

NLP Pipeline- Feature Engineering

Compute TF-IDF

Compute normalized TF-IDF

24 of 56

NLP Pipeline- Feature Engineering

TF-IDF for new example

25 of 56

NLP Pipeline- Feature Engineering

Advantages

Simple, easy to understand & interpret implementation
Builds over CountVectorizer to penalize highly frequent words & low frequency terms in a corpus
IDF achieves in reducing noise in our matrix

Disadvantages

Positional information of the word is still not captured in this representation
TF-IDF is highly corpus dependent

26 of 56

How to represent words?

N-gram language models

It is 76 F and .

P(w ∣ it is 76 F and)

[0.0001, 0.1, 0, 0, 0.002, …, 0.3, …, 0]

red sunny

Text classification

I like this movie.

I don’t like this movie.

[0, 1, 0, 0, 0, …, 1, …, 1]

[0, 1, 0, 1, 0, …, 1, …, 1]

don’t

P(y = 1 ∣ x) = σ(θ^⊺w + b)

_w(1)

_w(2)

27 of 56

Representing words as discrete symbols

In traditional NLP, we regard words as discrete symbols: hotel, conference, motel — a localist representation

one 1, the rest 0’s

Words can be represented by one-hot vectors:

hotel = [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]

motel = [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]

Vector dimension = number of words in vocabulary (e.g., 500,000)

Challenge: How to compute similarity of two words?

28 of 56

Representing words by their context

Distributional hypothesis: words that occur in similar contexts tend to have similar meanings

J.R.Firth 1957

•

“You shall know a word by the company it keeps”

One of the most successful ideas of modern statistical NLP!

These context words will represent banking.

29 of 56

Distributional hypothesis

“tejuino”

C1: A bottle of is on the table.

C2: Everybody likes .

C3: Don’t have before you drive. C4: We make out of corn.

30 of 56

Distributional hypothesis

	C1	C2	C3	C4
tejuino	1	1	1	1
loud	0	0	0	0
motor-oil	1	0	0	0
tortillas	0	1	0	1
choices	0	1	0	0
wine	1	1	1	0

C1: A bottle of is on the table.

C2: Everybody likes .

C3: Don’t have before you drive.

C4: We make out of corn.

“words that occur in similar contexts tend to have similar meanings”

31 of 56

Words as vectors

We’ll build a new model of meaning focusing on similarity

Each word is a vector
Similar words are “nearby in space”

A first solution: we can just use context vectors to represent the meaning of words!

word-word co-occurrence matrix:

32 of 56

Words as vectors

33 of 56

Sparse vs dense vectors

Still, the vectors we get from word-word occurrence matrix are sparse (most are 0’s) & long (vocabulary size)

Alternative: we want to represent words as short (50-300 dimensional) & dense (real-valued) vectors

The focus of this lecture
The basis of all the modern NLP systems

34 of 56

Dense vectors

35 of 56

Why dense vectors?

•

Short vectors are easier to use as features in ML systems

Dense vectors may generalize better than storing explicit counts They do better at capturing synonymy

w₁co-occurs with “car”, w₂co-occurs with “automobile”

•

Different methods for getting dense vectors:

•

Singular value decomposition (SVD) word2vec and friends: “learn” the vectors!

36 of 56

NLP Pipeline- Word Embedding

Idea: learn an embedding from words into vectors
Word embeddings depend on a notion of word similarity

Similarity is computed using cosine similarity

A very useful definition is paradigmatic similarity:

Similar words occur in similar contexts. They are exchangeable

“POTUS: President of the United States.”

POTUS

Yesterday The President called a press conference

Joe Biden

37 of 56

Hope to have similar words nearby

38 of 56

NLP Pipeline- Word Embedding

Neural Approach (Word embedding)

Each word is represented by real values as the vector of fixed dimensions

Here each value in the vector represents the measurements of some features or quality of the word

which is decided by the model after training on text data

This is not interpretable for humans but Just for representation purposes
We can understand this with the help of the given table

Now, The problem is how can we get these word embedding vectors

Train our own embedding layer

CBOW (Continuous Bag of Words), SkipGram

Pre-Trained Word Embeddings

These models are trained on a very large corpus
Word2vec, GloVe, fasttext

Source: https://www.geeksforgeeks.org/natural-language-processing-nlp-pipeline/

airplane =[0.7, 0.9, 0.9, 0.01, 0.35]

kite =[0.7, 0.9, 0.2, 0.01, 0.2]

	airplane	kite
Sky	0.7	0.7
Fly	0.9	0.9
Transport	0.9	0.2
Animal	0.01	0.01
Eat	0.35	0.2

39 of 56

NLP Pipeline- Word Embedding

Traditional Method

Either uses one hot encoding.

Each word in the vocabulary is represented by one bit position in a HUGE vector.
For example, if we have a vocabulary of 10000 words, and “Hello” is the 4th word in the dictionary, it would be represented by: 0 0 0 1 0 0 . . . . . . . 0 0 0

Or uses document representation.

Each word in the vocabulary is represented by its presence in documents.
For example, if we have a corpus of 1M documents, and “Hello” is in 1th, 3th and 5th documents only, it would be represented by: 1 0 1 0 1 0 . . . . . . . 0 0 0

Context information is not utilized.

Word Embeddings

Stores each word in as a point in space, where it is represented by a dense vector of fixed number of dimensions (generally 300) .
Unsupervised, built just by reading huge corpus.
For example, “Hello” might be represented as : [0.4, -0.11, 0.55, 0.3 . . . 0.1, 0.02].
Dimensions are basically projections along different axes, more of a mathematical concept.

40 of 56

Word Embedding – A distributed representation

Distributional representation – word embedding

Any word w_i in the corpus is given a distributional representation by an embedding

w_i ∈ R^d i.e a d-dimensional vector that is learnt.

For Example:

41 of 56

Distributional Representation

Take a vector with several hundred dimensions (say 1000).
Each word is represented by a distribution of weights across those elements.
So instead of a one-to-one mapping between an element in the vector and a word, the representation of a word is spread across all of the elements in the vector, and each element in the vector contributes to the definition of many words.

42 of 56

Distributional Representation: Illustration

If we label the dimensions in a hypothetical word vector (there are no such pre-assigned labels in the algorithm of course), it might look a bit like this:

43 of 56

Word embeddings: properties

Need to have a function W(word) that returns a vector encoding that word.

Similarity of words corresponds to nearby vectors.

Director – chairman, scratched – scraped

Relationships between words correspond to difference between vectors.

Big – bigger, small – smaller

44 of 56

Word embeddings: properties

Relationships between words correspond to difference between vectors.

http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/

45 of 56

NLP Pipeline- Word Embedding

Analogy Test

46 of 56

NLP Pipeline- Word Embedding

47 of 56

Word embeddings: relationships

Hope to preserve some language structure (relationships between words).

http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/

48 of 56

NLP Pipeline- Word Embedding

Instead of capturing co-occurrence counts directly, predict (using) surrounding words of every word.

Two Variations: CBOW and Skip-grams

49 of 56

NLP Pipeline- Word Embedding

CBOW (Continuous Bag of Words)

CBOW is a Word2Vec architecture that predicts a target word based on context
Unlike traditional bag-of-words models, CBOW takes into account a 'continuous' window of context words
In this case, we predict the center word from the given set of context words i.e previous and future of the center word
The model averages or sums the context word vectors and uses this result to predict the target word

Source: https://www.geeksforgeeks.org/natural-language-processing-nlp-pipeline/

I am learning Natural Language Processing from GFG.

I am learning Natural _____?_____ Processing from GFG.

The CBOW neural network architecture includes input, projection, and output layers.
The input layer receives the context words, which are then projected and averaged in the projection layer.
The output layer is a softmax layer predicting the probability of the target word

50 of 56

SkipGram

The Skip-Gram model is designed to predict the context given a target word
To produce a distributed representation of words where similar words have similar encoding
Particularly useful for learning representations for rare words in the corpus

Source: https://www.geeksforgeeks.org/natural-language-processing-nlp-pipeline/

I am learning Natural Language Processing from GFG.

I am __?___ _____?_____ Language ___?___ ____?____ GFG.

Input Layer: Receives the target word.

Hidden Layer: Contains the word embedding.

Output Layer: Predicts context words within a certain window.

NLP Pipeline- Word Embedding

51 of 56

NLP Pipeline- Word Embedding

Word2vec by Google

Word2Vec is a group of models that produce word embeddings, developed by a team of researchers at Google
It uses neural networks to learn word associations from a large corpus of text
Two main architectures: Continuous Bag of Words (CBOW) and Skip-Gram
CBOW predicts target words (e.g., 'muffin') from source context words ('blueberry', 'eat’)
Skip-Gram does the inverse, predicting source context words from the target words

Output:

Source: https://www.geeksforgeeks.org/natural-language-processing-nlp-pipeline/

52 of 56

NLP Pipeline- Word Embedding

GloVe by Stanford

GloVe stands for Global Vectors.
It's an unsupervised learning algorithm for obtaining vector representations for words, developed by Stanford researchers
It is based on matrix factorization techniques on the word-context matrix
It combines the benefits of Word2Vec and latent semantic analysis (LSA) by looking at global word-word co-occurrence
The model is trained to learn vectors such that their dot product equals the logarithm of the words' probability of co-occurrence

Output:

Source: https://www.geeksforgeeks.org/natural-language-processing-nlp-pipeline/

53 of 56

NLP Pipeline- Word Embedding

FastText - Advanced Word Representations by Facebook

FastText is an extension of Word2Vec proposed by Facebook Research
It treats each word as composed of character n-grams
For example, the word "apple" with n=3 would be represented as ["<ap", "app", "ppl", "ple", "le>"] plus the special sequence "<apple>" to denote the whole word.
This allows FastText to generate better word embeddings for rare words, or even for words not seen during training
It can also help understand suffixes and prefixes and is thus stronger for morphologically rich languages

Output:

Source: https://www.geeksforgeeks.org/natural-language-processing-nlp-pipeline/

54 of 56

Evaluating Word Embeddings

55 of 56

Extrinsic vs intrinsic evaluation

Extrinsic evaluation

Let’s plug these word embeddings into a real NLP system and see whether this improves performance
Could take a long time but still the most important evaluation metric

I don’t

0.31 0.01

⁽−0.28^{) (}−0.91⁾

1.87

⁽0.03⁾

−3.17

⁽−0.18⁾

1.23

⁽1.59⁾

this movie

ML model

👎

Intrinsic evaluation

•

Evaluate on a specific/intermediate subtask Fast to compute

Not clear if it really helps the downstream task

56 of 56

Intrinsic evaluation

Word similarity

Example dataset: wordsim-353

353 pairs of words with human judgement

http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

Cosine similarity:

Metric: Spearman rank correlation