2 of 37

NLP Pipeline

Data Acquisition
Text Cleaning
Text Preprocessing
Feature Engineering
Model Building
Evaluation
Deployment

NLP pipeline refers to the sequence of processes involved in analyzing and understanding human language

3 of 37

Deep learning Pipeline

4 of 37

NLP Pipeline-Data Acquisition

Data Acquisition

Collect data

Public Dataset: We can search for publicly available data as per our problem statement
Web Scrapping: Web Scrapping is a technique to scrap data from a website

Extract data

Image to Text: We can also scrap the data from the image files with the help of Optical character recognition (OCR)
pdf to Text: We have multiple Python packages to convert the data into text

Generate data

Data augmentation: if our acquired data is not very sufficient for our problem statement then we can generate fake data from the existing data

Synonym replacement
Back Translation
Bigram flipping
Adding some noise in data

5 of 37

NLP Pipeline-Text Cleaning

Text may contain HTML tags, spelling mistakes, or special characters
So, we need to clean the text by removing or standardizing erroneous data tokens
Unicode Normalization

Text data may contain symbols, emojis, graphic characters, or special characters
We can remove these characters, or we can convert this to machine-readable text

Regex or Regular Expression

Regular Expression is the tool that is used for searching the string of specific patterns
i.e., phone number, email-Id, and URL
we can keep or remove such text patterns as per requirements

Spelling corrections

Spelling mistakes are very common in case of online text

6 of 37

NLP Pipeline-Text Preprocessing

Once all the text is extracted and cleaned from the raw data, we can perform additional processing on it

7 of 37

NLP Pipeline- Text Preprocessing

Expanding Contractions

Contractions are words or combinations of words which are shortened by dropping letters and replacing them by an apostrophe
we’re = we are; we’ve = we have; I’d = I would
A contraction is an abbreviation for a sequence of words
Computer does not know that contractions are abbreviations for a sequence of words
Computer considers we’re and we are to be two completely different things and does not recognize that these two terms have the exact same meaning
Contractions increase the dimensionality of the document-term matrix
Expand contractions

Rule-based Approach;

Use a predefined set of rules to expand contractions.
Maps each contraction to its corresponding expanded form
Lack coverage for less common or ambiguous contractions

Statistical Language Models;

Utilizes a large corpora of text to learn the likelihood of word sequences
Can capture the context and predict the most probable expansion
Struggle with out-of-vocabulary contractions or cases where the context is insufficient

Neural Networks

Utilize deep learning models to expand contractions
Can learn complex patterns and relationships between words, improving their ability to handle ambiguous contractions
Trained on large datasets and can adapt to various contexts
require substantial computational resources and training data

8 of 37

NLP Pipeline- Text Preprocessing

Removing accented characters

Some characters are written with specific accents or symbols

to either imply a different pronunciation
or to signify that words containing such accented texts have a different meaning
résumé; document that highlights your professional skills and achievements
resume; continue a previous task or action

9 of 37

NLP Pipeline-Text Preprocessing

Chunking

Combining related tokens into a single token

creating related noun groups, related verb groups, etc.

For example, “New York City” could be treated as a single token/chunk instead of as three separate tokens
Chunking combines similar tokens together, making the overall process of analyzing the text a bit easier to perform

Lowercasing

This step is used to convert all the text to lowercase letters
This is useful in various NLP tasks such as text classification, information retrieval, and sentiment analysis

Stop words removal

Stop words are commonly occurring words in a language such as “the”, “and”, “a”, etc.
They are usually removed from the text during preprocessing because they do not carry much meaning and can cause noise in the data
Removal of stop words is not always beneficial, it depends on the problem
This step is used in various NLP tasks such as text classification, information retrieval, and topic modeling

10 of 37

NLP Pipeline-Text Preprocessing

Stemming and lemmatization

are used to reduce words to their base form
It can help reduce the vocabulary size and simplify the text

Stemming

Stemming involves stripping the suffixes from words to get their stem,

Lemmatization

Lemmatization involves reducing words to their base form based on their part of speech

Commonly used in various NLP tasks such as text classification, information retrieval, and topic modeling

11 of 37

NLP Pipeline-Text Preprocessing

Stemming vs Lemmatization

Stemming is useful in the context of search queries and information retrieval. As we are able to search more documents in the corpus and find relevant results.
Lemmatization makes different forms of the same words consistent with each other. This is useful in word vectorization.

12 of 37

NLP Pipeline-Text Preprocessing

Removing digits and punctuations

Remove digits and punctuation from the text
This is useful in various NLP tasks such as text classification, sentiment analysis, and topic modeling

POS tagging

POS tagging involves assigning a part of speech tag to each word in a text
This step is commonly used in various NLP tasks such as named entity recognition, sentiment analysis, and machine translation

Named Entity Recognition (NER)

NER involves identifying and classifying named entities in text, such as people, organizations, and locations
This step is commonly used in various NLP tasks such as information extraction, machine translation, and question-answering

13 of 37

Language identification

Language identification is the task of detecting the source language for the input text.

This is preliminary to spell checking, tokenization, acronym expansion, etc.

Several statistical techniques for this task: functional word frequency, N-gram language models (some later lecture), distance measure based on mutual information, etc.
Explore the following libraries Python

langdetect
Apache OpenNLP LanguageDetector

14 of 37

Spell checker

Spell checkers correct grammatical mistakes in text.

Especially useful for quickly written text, such as Twitter and Amazon reviews.

Spell checkers use approximate string matching algorithms such as
Levenshtein distance to find correct spellings.
Difficult cases: a misspelled word might still be in the language (English: than vs. then, their vs. there).

To deal with these cases, more sophisticated algorithms analyze the context formed by the surrounding words.

Explore the following libraries

Python TextBlob, based on the Natural Language Toolkit (NLTK) library
Resources for fuzzy string matching

https://github.com/seatgeek/fuzzywuzzy

https://pypi.org/project/fuzywuzzy/

15 of 37

Textual Variations in SM Text

16 of 37

OOV Words Handling

17 of 37

NLP Pipeline-Text Preprocessing

Tokenization

Tokenization is the process of segmenting the text into a list of meaningful chunks (tokens)
In the case of sentence tokenization, the token will be sentenced and in the case of word tokenization, it will be the word
It is a good idea to first complete sentence tokenization and then word tokenization
here output will be the list of lists
Tokenization is performed in each & every NLP pipeline
Tokens can be words, phrases, characters etc. depending on the application

18 of 37

Word tokenization

German writes compound nouns without spaces.

Example : Computerlinguistik, ‘computational linguistics’.
Several compound-splitter tools available.

Italian and Spanish incorporate verbs and clitics, which are special type of pronouns.

Example : comprarlo > comprare + lo, ‘to buy it’.
This process can be iterated on the same word.

19 of 37

Word tokenization

There are certain language-independent tokens that require
specialized processing

phone numbers: (800) 234-2333
dates: Mar 11, 1983
https://dateparser.readthedocs.io/en/latest/
email addresses: jblack@mail.yahoo.com
web URLs: http://stuff.big.com/new/specials.html hashtags: #nlproc

Use of regular expressions is recommended in these cases.

20 of 37

Character tokenization

Major east Asian languages (e.g., Chinese, Japanese, Korean, and Thai) write text without any spaces between words.
For most Chinese NLP tasks, character tokenization works better than word tokenization

each character generally represents a single unit of meaning
word tokenization results in huge vocabulary, with large number of very rare words

21 of 37

Subword tokenization

Many NLP systems need to deal with unknown words, that is, words that are not in the vocabulary of the system.
Example :
If the training corpus contains the words foot and ball, but not the word football, then if football appears in the test set the system does not know what to do.
Example :
If the training corpus contains the words low, new, newer but not lower, then if lower appears in the test set the system does not know what to do.

22 of 37

Subword tokenization

To deal with the problem of unknown words, modern tokenizers automatically induce sets of tokens that include tokens smaller than words, called subwords.
Subword tokenization reduces vocabulary size, and has become the most common tokenization method for large language modelling and neural models in general (see future lectures).
Subword tokenization is inspired by algorithms originally developed in information theory as a simple and fast form of data compression alternative to Lempel-Ziv-Welch.
Data compression provides more interesting results than morphemes.

23 of 37

Subword tokenization

Subword tokenization schemes consists of three different algorithms
the token learner takes a raw training corpus and induces a set of tokens, called vocabulary
the token segmenter (encoder) takes a vocabulary and a raw test sentence, and segments the sentence into the tokens in the vocabulary
the token merger (decoder) takes a token sequence and reconstructs the original sentence

24 of 37

Subword tokenization

Example :
Given the sample sentence ‘GPT-3 can be used for linguistics’
learner constructs the vocabulary:
t -, 3, be, can, for, G, istics, lingu, PT, used u
encoder translates sample sentence into token sequence:
G, PT, -, 3, can, be, used, for, lingu, istics
decoder translates back to the original sentence, including white spaces:
GPT-3 can be used for linguistics

25 of 37

Subword tokenization

Three algorithms are widely used for subword tokenization byte-pair encoding (BPE) tokenization
unigram tokenization WordPiece tokenization
Explore the following library
SentencePiece
Includes implementations of BPE and unigram tokenization

26 of 37

BPE: learner

The BPE token learner is usually run inside words, not merging across word boundaries. To this end, use a special end-of-word marker.
The algorithm iterates through the following steps
begin with a vocabulary composed by all individual characters
choose the two symbols A, B that are most frequently adjacent
add a new merged symbol AB to the vocabulary replace every adjacent A, B in the corpus with AB
The algorithm follows a greedy approach.
Stop when the vocabulary reaches size k, a hyperparameter.
Stopping criterion can alternatively be the number of iterations (merges).

27 of 37

BPE: learner

Example : Underscore is the end-of-word marker

Most frequent pair is e, r with a total of 9 occurrences (we arbitrarily break ties).

28 of 37

BPE: learner

The algorithm now learns the word-final token er

The next merge produces token ne

29 of 37

BPE: learner

If we continue, the next merges

After several iterations, BPE

learns entire words
most frequent units, useful for tokenizing unknown words

30 of 37

BPE: learner

Two versions of BPE token segmenter (encoder)
apply merge rules in frequency order all over the data set
for each word, left-to-right, match longest token from vocabulary (eager)
Not clear whether the two algorithms always provide the same encoding.
Example :
Assume training corpus contained words newer, low, but not lower. Typically, the test word [lower] will be encoded by means of tokens [low, er_].

31 of 37

BPE: encoder

Encoding is computationally expensive.
Many systems use some form of caching:
pre-tokenize all the words and save how a word should be tokenized in a dictionary
when an unknown word (not in dictionary) is seen
apply the encoder to tokenize the word
add the tokenization to the dictionary for future reference

32 of 37

BPE: decoder

BPE token merger: To decode, we have to
concatenate all the tokens together to get the whole word use the end-of-word marker to solve possible ambiguities
Example :
The encoded sequence
[the_, high, est_, range_, in_, Seattle_]
will be decoded as
[the, highest, range, in, Seattle]
as opposed to
[the, high, estrange, in, Seattle]

33 of 37

WordPiece

WordPiece is a subword tokenization algorithm used by the large language model BERT.
BERT will be presented in a later lecture.
Like BPE, WordPiece starts from the initial alphabet and learns merge rules.
The main difference is the way pair A, B is selected to be merged (f pX q is the frequency of token X )

The algorithm prioritizes the merging of pairs where the individual parts are less frequent in the vocabulary.

34 of 37

NLP Pipeline- Feature Engineering

Main task is to represent the text in the numeric vector in such a way that the ML algorithm can understand the text attribute
There are two most common approaches for Text Representation

Classical or Traditional Approach

In the traditional approach, we create a vocabulary of unique words assign a unique id (integer value) for each word. and then replace each word of a sentence with its unique id
Each word of vocabulary is treated as a feature. So, when the vocabulary is large then the feature size will become very large
One Hot Encoder, Bag of Word(Bow), Bag of n-grams, TF-IDF etc.

Neural Approach (Word embedding)

The above technique is not very good for complex tasks like Text Generation, Text summarization, etc.
Because they can’t understand the contextual meaning of words
Try to incorporate the contextual meaning of the words

35 of 37

NLP Pipeline- Model Building

Heuristic-Based Model

Lexicon-Based-Sentiment- Analysis

Works by counting Positive and Negative words in sentences

Wordnet

It has a database of words with synonyms, hyponyms, and meronyms
It uses this database for solving rule-based NLP tasks

When we have no very fewer data, we can use a heuristic approach

Machine Learning Model

Naive Bayes

It is a group of classification algorithms based on Bayes’ Theorem
It assumes that each feature has an equal and independent contribution to the outcomes
Often used for document classification tasks, such as sentiment analysis or spam filtering

Support Vector Machine (SVM)

It is a popular supervised learning algorithm used for classification and regression analysis
It attempts to find the best hyperplane that separates the data points into different classes while maximizing the margin between the hyperplane and the closest data points
In the context of NLP, SVM is often used for text classification tasks, such as sentiment analysis or topic classification

Deep Learning Model

Recurrent neural networks, LSTM, GRU etc.

36 of 37

NLP Pipeline- Evaluation

Evaluation matric depends on the type of NLP task or problem
Some of the popular methods for evaluation according to the NLP tasks are

Classification

Accuracy, Precision, Recall, F1-score, AUC

Sequence Labelling

F1-Score

Information Retrieval

Mean Reciprocal rank(MRR), Mean Average Precision (MAP),

Text summarization

ROUGE- Recall-Oriented Understudy for Gisting Evaluation

Regression [Stock Market Price predictions, Temperature Predictions]

Root Mean Square Error, Mean Absolute Percentage Error

Text Generation

BLEU (Bi-lingual Evaluation Understanding), Perplexity

Machine Translation

BLEU (Bi-lingual Evaluation Understanding), METEOR

37 of 37

NLP Pipeline- Deployment

Making a trained NLP model usable in a production setting is known as deployment
The precise deployment process can vary based on the platform and use case

Export the trained model
Prepare the input pipeline
Set up the inference service
Monitor performance and scale
Continuous improvement