1 of 56

Features Engineering and Text Representation Learning

Natural Language Processing

Prof. Jebran Khan

2 of 56

Text Representation Learning

  • Machine does not understand text
    • We need a numeric representation
  • Unlike images (RGB matrix), for text there is no obvious way
  • Representation learning is a set of techniques that learn a feature
    • A transformation of the raw data to a representation that can be effectively exploited in machine learning tasks
  • Part of feature engineering/learning
  • Get rid of “hand-designed” features and representation

3 of 56

NLP Pipeline- Feature Engineering

  • Main task is to represent the text in the numeric vector in such a way that the ML algorithm can understand the text attribute
  • There are two most common approaches for Text Representation
    • Classical or Traditional Approach
      • In the traditional approach, we create a vocabulary of unique words assign a unique id (integer value) for each word. and then replace each word of a sentence with its unique id
      • Each word of vocabulary is treated as a feature. So, when the vocabulary is large then the feature size will become very large
      • One Hot Encoder, Bag of Word(Bow), Bag of n-grams, TF-IDF etc.
    • Neural Approach (Word embedding)
      • The above technique is not very good for complex tasks like Text Generation, Text summarization, etc.
      • Because they can’t understand the contextual meaning of words
      • Try to incorporate the contextual meaning of the words

4 of 56

NLP Pipeline- Feature Engineering

  •  

5 of 56

NLP Pipeline- Feature Engineering

  • Limitations
    • Sparsity
      • Only a single sentence creates a vector of n*m size where n is the length of sentence m is a number of unique words in a document
      • 80% of values in a vector is zero
    • No fixed Size
      • Each document is of a different length which creates vectors of different sizes
      • cannot feed to the model
    • Does not capture semantics
      • The core idea is we have to convert text into numbers by keeping in mind that the actual meaning of a sentence should be observed in numbers
      • that are not seen in one-hot encoding

Output:

6 of 56

NLP Pipeline- Feature Engineering

7 of 56

Bag of Words

A word of text.

A word is a token.

Tokens and features.

Few features of text.

m1:

m2:

m3:

m4:

Training data

a

word

of

text

a

word

is

a

token

tokens

and

features

few

features

of

text

Tokens

Bag of words

a

word

of

text

is

token

tokens

and

features

few

Features

One feature per unique token

8 of 56

Bag of Words: Example

A word of text.

A word is a token.

Tokens and features.

Few features of text.

m1:

m2:

m3:

m4:

a

word

of

text

is

token

tokens

and

features

few

Use bag of words when you have a lot of data, can use many features

m1

m2

m3

m4

test1

test1: Some features for a text example.

Selected Features

Training X

Test X

m1

m2

m3

m4

1

1

0

0

1

1

0

0

1

0

0

1

1

0

0

1

0

1

0

0

0

1

0

0

0

0

1

0

0

0

1

0

0

0

1

1

0

0

0

1

test1

1

0

0

1

0

0

0

0

1

0

Out of

vocabulary

9 of 56

NLP Pipeline- Feature Engineering

Output:

10 of 56

NLP Pipeline- Feature Engineering

Normalize corpus

BoW features extraction

Features positions

Corpus representation

BoW feature vectors

11 of 56

NLP Pipeline- Feature Engineering

  • Advantages
    • Simple and intuitive
      • Easy to implement the technique
    • Fix size vector
      • Unlike one-hot encoding it ignores the new words and only consider the vocabulary words to creates a vector of fix size
  • Limitations
    • Out of vocabulary situation
      • It keeps count of vocabulary words so if new words come in a sentence it simply ignores it
    • Sparsity
      • When a large vocabulary, and the document contains a few repeated terms then it creates a sparse array
    • Not considering ordering is an issue
      • It is difficult to estimate the semantics of the document

12 of 56

NLP Pipeline- Feature Engineering

  • N-grams
    • Instead of using single tokens as features, use series of N tokens
    • “down the bank” vs “from the bank”

Message 1: “Nah I don't think he goes to usf”

Message 2: “Text FA to 87121 to receive entry”

Nah I

I don’t

don’t

think

think

he

he goes

goes to

to usf

Text FA

FA to

87121 to

To receive

receive entry

0

0

0

0

0

0

0

1

1

1

1

1

Message 2:

Use when you have a LOT of data, can use MANY features

13 of 56

N-Grams: Characters

  • Instead of using series of tokens, use series of characters

Message 1: “Nah I don't think he goes to usf”

Message 2: “Text FA to 87121 to receive entry”

Na

ah

h <space>

<space> I

I <space>

<space> d

do

<space> e

en

nt

tr

ry

0

0

0

0

0

0

0

1

1

1

1

1

Message 2:

Helps with out of dictionary words & spelling errors

Fixed number of features for given N (but can be very large)

14 of 56

NLP Pipeline- Feature Engineering

  • Bag of n-grams
    • In Bag of Words, there is no consideration of the phrases or word order
    • Bag of n-gram tries to solve this problem by breaking text into chunks of n continuous words

Output:

15 of 56

NLP Pipeline- Feature Engineering

Normalize corpus

Bag of N-grams features vectors

16 of 56

NLP Pipeline- Feature Engineering

  • Advantages
    • Able to capture semantic meaning of the sentence
      • As we use Bigram or trigram then it capture the word relationship by exploiting the word sequence in sentences
      • Intuitive and easy to implement
        • Implementation of N-Gram is straightforward with a little bit of modification in Bag of words
  • Disadvantages
    • Increased vocabulary size
      • As we move from unigram to N-Gram then dimension of vector formation or vocabulary increases
      • it takes more time in computation and prediction
    • No solution for out of vocabulary terms
      • We do not have a way other than ignoring the new words in a new sentence

17 of 56

NLP Pipeline- Feature Engineering

  • TF-IDF (Term Frequency – Inverse Document Frequency)
    • In previous techniques, each word is treated equally
    • TF-IDF tries to quantify the importance of a given word relative to the other word in the corpus
    • Mainly used in Information retrieval
    • Term Frequency (TF):
      • TF measures how often a word occurs in the given document
      • It is the ratio of the number of occurrences of a term or word (t ) in a given document (d) to the total number of terms in a given document (d)

    • Inverse document frequency (IDF)
      • IDF measures the importance of the word across the corpus
      • It reduce the weight of the most frequent terms, and increase the weight of rare terms

    • TF-IDF score is the product of TF and IDF

18 of 56

TF-IDF�Term Frequency – Inverse Document Frequency

  • Instead of using binary: ContainsWord(<term>)
  • Use numeric importance score TF-IDF:

�TermFrequency(<term>, <document>) =

% of the words in <document> that are <term>

InverseDocumentFrequency(<term>, <documents>) =

log ( # documents / # documents that contain <term> )

 

Nah

I

don't

think

he

goes

to

usf

Text

FA

87121

receive

entry

BOW

0

0

0

0

0

0

1

0

1

1

1

1

1

TF-IDF

0

0

0

0

0

0

0

0

.099

.099

.099

.099

.099

Message 1: “Nah I don't think he goes to usf”

Message 2: “Text FA to 87121 to receive entry”

Message 2:

Importance to

Document

Novelty across

corpus

19 of 56

NLP Pipeline- Feature Engineering

Output:

20 of 56

NLP Pipeline- Feature Engineering

Normalize corpus

TF-IDF features vectors

21 of 56

NLP Pipeline- Feature Engineering

Collect words

Create bag of words for term frequency

22 of 56

NLP Pipeline- Feature Engineering

Create document frequency matrix

Calculate IDF

23 of 56

NLP Pipeline- Feature Engineering

Compute TF-IDF

Compute normalized TF-IDF

24 of 56

NLP Pipeline- Feature Engineering

TF-IDF for new example

25 of 56

NLP Pipeline- Feature Engineering

  • Advantages
    • Simple, easy to understand & interpret implementation
    • Builds over CountVectorizer to penalize highly frequent words & low frequency terms in a corpus
    • IDF achieves in reducing noise in our matrix
  • Disadvantages
    • Positional information of the word is still not captured in this representation
    • TF-IDF is highly corpus dependent

26 of 56

How to represent words?

N-gram language models

It is 76 F and .

P(w it is 76 F and)

[0.0001, 0.1, 0, 0, 0.002, …, 0.3, …, 0]

red sunny

Text classification

I like this movie.

I don’t like this movie.

[0, 1, 0, 0, 0, …, 1, …, 1]

[0, 1, 0, 1, 0, …, 1, …, 1]

don’t

P(y = 1 x) = σ(θw + b)

w(1)

w(2)

27 of 56

Representing words as discrete symbols

In traditional NLP, we regard words as discrete symbols: hotel, conference, motel — a localist representation

one 1, the rest 0’s

Words can be represented by one-hot vectors:

hotel = [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]

motel = [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]

Vector dimension = number of words in vocabulary (e.g., 500,000)

Challenge: How to compute similarity of two words?

28 of 56

Representing words by their context

Distributional hypothesis: words that occur in similar contexts tend to have similar meanings

J.R.Firth 1957

“You shall know a word by the company it keeps”

One of the most successful ideas of modern statistical NLP!

These context words will represent banking.

29 of 56

Distributional hypothesis

“tejuino”

C1: A bottle of is on the table.

C2: Everybody likes .

C3: Don’t have before you drive. C4: We make out of corn.

30 of 56

Distributional hypothesis

C1

C2

C3

C4

tejuino

1

1

1

1

loud

0

0

0

0

motor-oil

1

0

0

0

tortillas

0

1

0

1

choices

0

1

0

0

wine

1

1

1

0

C1: A bottle of is on the table.

C2: Everybody likes .

C3: Don’t have before you drive.

C4: We make out of corn.

“words that occur in similar contexts tend to have similar meanings”

31 of 56

Words as vectors

  • We’ll build a new model of meaning focusing on similarity
    • Each word is a vector
    • Similar words are “nearby in space”

  • A first solution: we can just use context vectors to represent the meaning of words!
    • word-word co-occurrence matrix:

32 of 56

Words as vectors

33 of 56

Sparse vs dense vectors

  • Still, the vectors we get from word-word occurrence matrix are sparse (most are 0’s) & long (vocabulary size)

  • Alternative: we want to represent words as short (50-300 dimensional) & dense (real-valued) vectors
    • The focus of this lecture
    • The basis of all the modern NLP systems

34 of 56

Dense vectors

35 of 56

Why dense vectors?

Short vectors are easier to use as features in ML systems

Dense vectors may generalize better than storing explicit counts They do better at capturing synonymy

  • w1 co-occurs with “car”, w2 co-occurs with “automobile”

Different methods for getting dense vectors:

Singular value decomposition (SVD) word2vec and friends: “learn” the vectors!

36 of 56

NLP Pipeline- Word Embedding

  • Idea: learn an embedding from words into vectors
  • Word embeddings depend on a notion of word similarity
    • Similarity is computed using cosine similarity
  • A very useful definition is paradigmatic similarity:
    • Similar words occur in similar contexts. They are exchangeable

  • “POTUS: President of the United States.”

POTUS

Yesterday The President called a press conference

Joe Biden

37 of 56

  • Hope to have similar words nearby

38 of 56

NLP Pipeline- Word Embedding

  • Neural Approach (Word embedding)
    • Each word is represented by real values as the vector of fixed dimensions

    • Here each value in the vector represents the measurements of some features or quality of the word
      • which is decided by the model after training on text data
    • This is not interpretable for humans but Just for representation purposes
    • We can understand this with the help of the given table
  • Now, The problem is how can we get these word embedding vectors
    • Train our own embedding layer
      • CBOW (Continuous Bag of Words), SkipGram
    • Pre-Trained Word Embeddings
      • These models are trained on a very large corpus
      • Word2vec, GloVe, fasttext

airplane =[0.7, 0.9, 0.9, 0.01, 0.35]

kite =[0.7, 0.9, 0.2, 0.01, 0.2]

 

airplane

kite

Sky

0.7

0.7

Fly

0.9

0.9

Transport

0.9

0.2

Animal

0.01

0.01

Eat

0.35

0.2

39 of 56

NLP Pipeline- Word Embedding

  • Traditional Method
  • Either uses one hot encoding.
    • Each word in the vocabulary is represented by one bit position in a HUGE vector.
    • For example, if we have a vocabulary of 10000 words, and “Hello” is the 4th word in the dictionary, it would be represented by: 0 0 0 1 0 0 . . . . . . . 0 0 0
  • Or uses document representation.
    • Each word in the vocabulary is represented by its presence in documents.
    • For example, if we have a corpus of 1M documents, and “Hello” is in 1th, 3th and 5th documents only, it would be represented by: 1 0 1 0 1 0 . . . . . . . 0 0 0
  • Context information is not utilized.
  • Word Embeddings
  • Stores each word in as a point in space, where it is represented by a dense vector of fixed number of dimensions (generally 300) .
  • Unsupervised, built just by reading huge corpus.
  • For example, “Hello” might be represented as : [0.4, -0.11, 0.55, 0.3 . . . 0.1, 0.02].
  • Dimensions are basically projections along different axes, more of a mathematical concept.

40 of 56

Word Embedding – A distributed representation

  • Distributional representation – word embedding
    • Any word wi in the corpus is given a distributional representation by an embedding
      • wi ∈ Rd i.e a d-dimensional vector that is learnt.
  • For Example:

41 of 56

Distributional Representation

  • Take a vector with several hundred dimensions (say 1000).
  • Each word is represented by a distribution of weights across those elements.
  • So instead of a one-to-one mapping between an element in the vector and a word, the representation of a word is spread across all of the elements in the vector, and each element in the vector contributes to the definition of many words.

42 of 56

Distributional Representation: Illustration

  • If we label the dimensions in a hypothetical word vector (there are no such pre-assigned labels in the algorithm of course), it might look a bit like this:

43 of 56

Word embeddings: properties

  • Need to have a function W(word) that returns a vector encoding that word.

  • Similarity of words corresponds to nearby vectors.
    • Director – chairman, scratched – scraped

  • Relationships between words correspond to difference between vectors.
    • Big – bigger, small – smaller

44 of 56

Word embeddings: properties

  • Relationships between words correspond to difference between vectors.

http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/

45 of 56

NLP Pipeline- Word Embedding

Analogy Test

46 of 56

NLP Pipeline- Word Embedding

  •  

47 of 56

Word embeddings: relationships

  • Hope to preserve some language structure (relationships between words).

http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/

48 of 56

NLP Pipeline- Word Embedding

  • Instead of capturing co-occurrence counts directly, predict (using) surrounding words of every word.
    • Two Variations: CBOW and Skip-grams

49 of 56

NLP Pipeline- Word Embedding

  • CBOW (Continuous Bag of Words)
    • CBOW is a Word2Vec architecture that predicts a target word based on context
    • Unlike traditional bag-of-words models, CBOW takes into account a 'continuous' window of context words
    • In this case, we predict the center word from the given set of context words i.e previous and future of the center word
    • The model averages or sums the context word vectors and uses this result to predict the target word

I am learning Natural Language Processing from GFG.

I am learning Natural _____?_____ Processing from GFG.

  • The CBOW neural network architecture includes input, projection, and output layers.
  • The input layer receives the context words, which are then projected and averaged in the projection layer.
  • The output layer is a softmax layer predicting the probability of the target word

50 of 56

  • SkipGram
    • The Skip-Gram model is designed to predict the context given a target word
    • To produce a distributed representation of words where similar words have similar encoding
    • Particularly useful for learning representations for rare words in the corpus

I am learning Natural Language Processing from GFG.

I am __?___ _____?_____ Language ___?___ ____?____ GFG.

Input Layer: Receives the target word.

Hidden Layer: Contains the word embedding.

Output Layer: Predicts context words within a certain window.

NLP Pipeline- Word Embedding

51 of 56

NLP Pipeline- Word Embedding

  • Word2vec by Google
    • Word2Vec is a group of models that produce word embeddings, developed by a team of researchers at Google
    • It uses neural networks to learn word associations from a large corpus of text
    • Two main architectures: Continuous Bag of Words (CBOW) and Skip-Gram
    • CBOW predicts target words (e.g., 'muffin') from source context words ('blueberry', 'eat’)
    • Skip-Gram does the inverse, predicting source context words from the target words

Output:

52 of 56

NLP Pipeline- Word Embedding

  • GloVe by Stanford
    • GloVe stands for Global Vectors.
    • It's an unsupervised learning algorithm for obtaining vector representations for words, developed by Stanford researchers
    • It is based on matrix factorization techniques on the word-context matrix
    • It combines the benefits of Word2Vec and latent semantic analysis (LSA) by looking at global word-word co-occurrence
    • The model is trained to learn vectors such that their dot product equals the logarithm of the words' probability of co-occurrence

Output:

53 of 56

NLP Pipeline- Word Embedding

  • FastText - Advanced Word Representations by Facebook
    • FastText is an extension of Word2Vec proposed by Facebook Research
    • It treats each word as composed of character n-grams
    • For example, the word "apple" with n=3 would be represented as ["<ap", "app", "ppl", "ple", "le>"] plus the special sequence "<apple>" to denote the whole word.
    • This allows FastText to generate better word embeddings for rare words, or even for words not seen during training
    • It can also help understand suffixes and prefixes and is thus stronger for morphologically rich languages

Output:

54 of 56

Evaluating Word Embeddings

55 of 56

Extrinsic vs intrinsic evaluation

Extrinsic evaluation

  • Let’s plug these word embeddings into a real NLP system and see whether this improves performance
  • Could take a long time but still the most important evaluation metric

I don’t

0.31 0.01

(0.28) (0.91)

1.87

(0.03)

like

3.17

(0.18)

1.23

(1.59)

this movie

ML model

👎

Intrinsic evaluation

Evaluate on a specific/intermediate subtask Fast to compute

Not clear if it really helps the downstream task

56 of 56

Intrinsic evaluation

Word similarity

Example dataset: wordsim-353

353 pairs of words with human judgement

http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

Cosine similarity:

Metric: Spearman rank correlation