1 of 37

NLP- Text Processing Pipeline

Natural Language Processing

Prof. Jebran Khan

2 of 37

NLP Pipeline

  • Data Acquisition
  • Text Cleaning
  • Text Preprocessing
  • Feature Engineering
  • Model Building
  • Evaluation
  • Deployment

NLP pipeline refers to the sequence of processes involved in analyzing and understanding human language

3 of 37

Deep learning Pipeline

4 of 37

NLP Pipeline-Data Acquisition

  • Data Acquisition
    • Collect data
      • Public Dataset: We can search for publicly available data as per our problem statement
      • Web Scrapping: Web Scrapping is a technique to scrap data from a website
    • Extract data
      • Image to Text: We can also scrap the data from the image files with the help of Optical character recognition (OCR)
      • pdf to Text: We have multiple Python packages to convert the data into text
    • Generate data
      • Data augmentation: if our acquired data is not very sufficient for our problem statement then we can generate fake data from the existing data
        • Synonym replacement
        • Back Translation
        • Bigram flipping
        • Adding some noise in data

5 of 37

NLP Pipeline-Text Cleaning

  • Text may contain HTML tags, spelling mistakes, or special characters
  • So, we need to clean the text by removing or standardizing erroneous data tokens
  • Unicode Normalization
    • Text data may contain symbols, emojis, graphic characters, or special characters
    • We can remove these characters, or we can convert this to machine-readable text
  • Regex or Regular Expression
    • Regular Expression is the tool that is used for searching the string of specific patterns
    • i.e., phone number, email-Id, and URL
    • we can keep or remove such text patterns as per requirements
  • Spelling corrections
    • Spelling mistakes are very common in case of online text

6 of 37

NLP Pipeline-Text Preprocessing

  • Once all the text is extracted and cleaned from the raw data, we can perform additional processing on it

7 of 37

NLP Pipeline- Text Preprocessing

  • Expanding Contractions
    • Contractions are words or combinations of words which are shortened by dropping letters and replacing them by an apostrophe
    • we’re = we are; we’ve = we have; I’d = I would
    • A contraction is an abbreviation for a sequence of words
    • Computer does not know that contractions are abbreviations for a sequence of words
    • Computer considers we’re and we are to be two completely different things and does not recognize that these two terms have the exact same meaning
    • Contractions increase the dimensionality of the document-term matrix
    • Expand contractions
      • Rule-based Approach;
        • Use a predefined set of rules to expand contractions.
        • Maps each contraction to its corresponding expanded form
        • Lack coverage for less common or ambiguous contractions
      • Statistical Language Models;
        • Utilizes a large corpora of text to learn the likelihood of word sequences
        • Can capture the context and predict the most probable expansion
        • Struggle with out-of-vocabulary contractions or cases where the context is insufficient
      • Neural Networks
        • Utilize deep learning models to expand contractions
        • Can learn complex patterns and relationships between words, improving their ability to handle ambiguous contractions
        • Trained on large datasets and can adapt to various contexts
        • require substantial computational resources and training data

8 of 37

NLP Pipeline- Text Preprocessing

  • Removing accented characters
    • Some characters are written with specific accents or symbols
      • to either imply a different pronunciation
      • or to signify that words containing such accented texts have a different meaning
      • résumé; document that highlights your professional skills and achievements
      • resume; continue a previous task or action

9 of 37

NLP Pipeline-Text Preprocessing

  • Chunking
    • Combining related tokens into a single token
      • creating related noun groups, related verb groups, etc.
    • For example, “New York City” could be treated as a single token/chunk instead of as three separate tokens
    • Chunking combines similar tokens together, making the overall process of analyzing the text a bit easier to perform
  • Lowercasing
    • This step is used to convert all the text to lowercase letters
    • This is useful in various NLP tasks such as text classification, information retrieval, and sentiment analysis
  • Stop words removal
    • Stop words are commonly occurring words in a language such as “the”, “and”, “a”, etc.
    • They are usually removed from the text during preprocessing because they do not carry much meaning and can cause noise in the data
    • Removal of stop words is not always beneficial, it depends on the problem
    • This step is used in various NLP tasks such as text classification, information retrieval, and topic modeling

10 of 37

NLP Pipeline-Text Preprocessing

  • Stemming and lemmatization
    • are used to reduce words to their base form
    • It can help reduce the vocabulary size and simplify the text
  • Stemming
    • Stemming involves stripping the suffixes from words to get their stem,
  • Lemmatization
    • Lemmatization involves reducing words to their base form based on their part of speech
  • Commonly used in various NLP tasks such as text classification, information retrieval, and topic modeling

11 of 37

NLP Pipeline-Text Preprocessing

  • Stemming vs Lemmatization
    • Stemming is useful in the context of search queries and information retrieval. As we are able to search more documents in the corpus and find relevant results.
    • Lemmatization makes different forms of the same words consistent with each other. This is useful in word vectorization.

12 of 37

NLP Pipeline-Text Preprocessing

  • Removing digits and punctuations
    • Remove digits and punctuation from the text
    • This is useful in various NLP tasks such as text classification, sentiment analysis, and topic modeling
  • POS tagging
    • POS tagging involves assigning a part of speech tag to each word in a text
    • This step is commonly used in various NLP tasks such as named entity recognition, sentiment analysis, and machine translation
  • Named Entity Recognition (NER)
    • NER involves identifying and classifying named entities in text, such as people, organizations, and locations
    • This step is commonly used in various NLP tasks such as information extraction, machine translation, and question-answering

13 of 37

Language identification

  • Language identification is the task of detecting the source language for the input text.
    • This is preliminary to spell checking, tokenization, acronym expansion, etc.
  • Several statistical techniques for this task: functional word frequency, N-gram language models (some later lecture), distance measure based on mutual information, etc.
  • Explore the following libraries Python
    • langdetect
    • Apache OpenNLP LanguageDetector

14 of 37

Spell checker

  • Spell checkers correct grammatical mistakes in text.
    • Especially useful for quickly written text, such as Twitter and Amazon reviews.
  • Spell checkers use approximate string matching algorithms such as
  • Levenshtein distance to find correct spellings.
  • Difficult cases: a misspelled word might still be in the language (English: than vs. then, their vs. there).
    • To deal with these cases, more sophisticated algorithms analyze the context formed by the surrounding words.
  • Explore the following libraries
    • Python TextBlob, based on the Natural Language Toolkit (NLTK) library
    • Resources for fuzzy string matching

https://github.com/seatgeek/fuzzywuzzy

https://pypi.org/project/fuzywuzzy/

15 of 37

Textual Variations in SM Text

16 of 37

OOV Words Handling

17 of 37

NLP Pipeline-Text Preprocessing

  • Tokenization
    • Tokenization is the process of segmenting the text into a list of meaningful chunks (tokens)
    • In the case of sentence tokenization, the token will be sentenced and in the case of word tokenization, it will be the word
    • It is a good idea to first complete sentence tokenization and then word tokenization
    • here output will be the list of lists
    • Tokenization is performed in each & every NLP pipeline
    • Tokens can be words, phrases, characters etc. depending on the application

18 of 37

Word tokenization

  • German writes compound nouns without spaces.
    • Example : Computerlinguistik, ‘computational linguistics’.
    • Several compound-splitter tools available.
  • Italian and Spanish incorporate verbs and clitics, which are special type of pronouns.
    • Example : comprarlo > comprare + lo, ‘to buy it’.
    • This process can be iterated on the same word.

19 of 37

Word tokenization

  • There are certain language-independent tokens that require
  • specialized processing
    • phone numbers: (800) 234-2333
    • dates: Mar 11, 1983
    • https://dateparser.readthedocs.io/en/latest/
    • email addresses: jblack@mail.yahoo.com
    • web URLs: http://stuff.big.com/new/specials.html hashtags: #nlproc
  • Use of regular expressions is recommended in these cases.

20 of 37

Character tokenization

  • Major east Asian languages (e.g., Chinese, Japanese, Korean, and Thai) write text without any spaces between words.
  • For most Chinese NLP tasks, character tokenization works better than word tokenization
    • each character generally represents a single unit of meaning
    • word tokenization results in huge vocabulary, with large number of very rare words

21 of 37

Subword tokenization

  • Many NLP systems need to deal with unknown words, that is, words that are not in the vocabulary of the system.
  • Example :
  • If the training corpus contains the words foot and ball, but not the word football, then if football appears in the test set the system does not know what to do.
  • Example :
  • If the training corpus contains the words low, new, newer but not lower, then if lower appears in the test set the system does not know what to do.

22 of 37

Subword tokenization

  • To deal with the problem of unknown words, modern tokenizers automatically induce sets of tokens that include tokens smaller than words, called subwords.
  • Subword tokenization reduces vocabulary size, and has become the most common tokenization method for large language modelling and neural models in general (see future lectures).
  • Subword tokenization is inspired by algorithms originally developed in information theory as a simple and fast form of data compression alternative to Lempel-Ziv-Welch.
  • Data compression provides more interesting results than morphemes.

23 of 37

Subword tokenization

  • Subword tokenization schemes consists of three different algorithms
  • the token learner takes a raw training corpus and induces a set of tokens, called vocabulary
  • the token segmenter (encoder) takes a vocabulary and a raw test sentence, and segments the sentence into the tokens in the vocabulary
  • the token merger (decoder) takes a token sequence and reconstructs the original sentence

24 of 37

Subword tokenization

  • Example :
  • Given the sample sentence ‘GPT-3 can be used for linguistics’
  • learner constructs the vocabulary:
  • t -, 3, be, can, for, G, istics, lingu, PT, used u
  • encoder translates sample sentence into token sequence:
  • G, PT, -, 3, can, be, used, for, lingu, istics
  • decoder translates back to the original sentence, including white spaces:
  • GPT-3 can be used for linguistics

25 of 37

Subword tokenization

  • Three algorithms are widely used for subword tokenization byte-pair encoding (BPE) tokenization
  • unigram tokenization WordPiece tokenization
  • Explore the following library
  • SentencePiece
  • Includes implementations of BPE and unigram tokenization

26 of 37

BPE: learner

  • The BPE token learner is usually run inside words, not merging across word boundaries. To this end, use a special end-of-word marker.
  • The algorithm iterates through the following steps
  • begin with a vocabulary composed by all individual characters
  • choose the two symbols A, B that are most frequently adjacent
  • add a new merged symbol AB to the vocabulary replace every adjacent A, B in the corpus with AB
  • The algorithm follows a greedy approach.
  • Stop when the vocabulary reaches size k, a hyperparameter.
  • Stopping criterion can alternatively be the number of iterations (merges).

27 of 37

BPE: learner

  • Example : Underscore is the end-of-word marker
  • Most frequent pair is e, r with a total of 9 occurrences (we arbitrarily break ties).

28 of 37

BPE: learner

  • The algorithm now learns the word-final token er
  • The next merge produces token ne

29 of 37

BPE: learner

  • If we continue, the next merges
  • After several iterations, BPE
    • learns entire words
    • most frequent units, useful for tokenizing unknown words

30 of 37

BPE: learner

  • Two versions of BPE token segmenter (encoder)
  • apply merge rules in frequency order all over the data set
  • for each word, left-to-right, match longest token from vocabulary (eager)
  • Not clear whether the two algorithms always provide the same encoding.
  • Example :
  • Assume training corpus contained words newer, low, but not lower. Typically, the test word [lower] will be encoded by means of tokens [low, er_].

31 of 37

BPE: encoder

  • Encoding is computationally expensive.
  • Many systems use some form of caching:
  • pre-tokenize all the words and save how a word should be tokenized in a dictionary
  • when an unknown word (not in dictionary) is seen
  • apply the encoder to tokenize the word
  • add the tokenization to the dictionary for future reference

32 of 37

BPE: decoder

  • BPE token merger: To decode, we have to
  • concatenate all the tokens together to get the whole word use the end-of-word marker to solve possible ambiguities
  • Example :
  • The encoded sequence
  • [the_, high, est_, range_, in_, Seattle_]
  • will be decoded as
  • [the, highest, range, in, Seattle]
  • as opposed to
  • [the, high, estrange, in, Seattle]

33 of 37

WordPiece

  • WordPiece is a subword tokenization algorithm used by the large language model BERT.
  • BERT will be presented in a later lecture.
  • Like BPE, WordPiece starts from the initial alphabet and learns merge rules.
  • The main difference is the way pair A, B is selected to be merged (f pX q is the frequency of token X )

  • The algorithm prioritizes the merging of pairs where the individual parts are less frequent in the vocabulary.

34 of 37

NLP Pipeline- Feature Engineering

  • Main task is to represent the text in the numeric vector in such a way that the ML algorithm can understand the text attribute
  • There are two most common approaches for Text Representation
    • Classical or Traditional Approach
      • In the traditional approach, we create a vocabulary of unique words assign a unique id (integer value) for each word. and then replace each word of a sentence with its unique id
      • Each word of vocabulary is treated as a feature. So, when the vocabulary is large then the feature size will become very large
      • One Hot Encoder, Bag of Word(Bow), Bag of n-grams, TF-IDF etc.
    • Neural Approach (Word embedding)
      • The above technique is not very good for complex tasks like Text Generation, Text summarization, etc.
      • Because they can’t understand the contextual meaning of words
      • Try to incorporate the contextual meaning of the words

35 of 37

NLP Pipeline- Model Building

  • Heuristic-Based Model
    • Lexicon-Based-Sentiment- Analysis
      • Works by counting Positive and Negative words in sentences
    • Wordnet
      • It has a database of words with synonyms, hyponyms, and meronyms
      • It uses this database for solving rule-based NLP tasks
    • When we have no very fewer data, we can use a heuristic approach
  • Machine Learning Model
    • Naive Bayes
      • It is a group of classification algorithms based on Bayes’ Theorem
      • It assumes that each feature has an equal and independent contribution to the outcomes
      • Often used for document classification tasks, such as sentiment analysis or spam filtering
    • Support Vector Machine (SVM)
      • It is a popular supervised learning algorithm used for classification and regression analysis
      • It attempts to find the best hyperplane that separates the data points into different classes while maximizing the margin between the hyperplane and the closest data points
      • In the context of NLP, SVM is often used for text classification tasks, such as sentiment analysis or topic classification
  • Deep Learning Model
    • Recurrent neural networks, LSTM, GRU etc.

36 of 37

NLP Pipeline- Evaluation

  • Evaluation matric depends on the type of NLP task or problem
  • Some of the popular methods for evaluation according to the NLP tasks are
    • Classification
      • Accuracy, Precision, Recall, F1-score, AUC
    • Sequence Labelling
      • F1-Score
    • Information Retrieval
      • Mean Reciprocal rank(MRR), Mean Average Precision (MAP),
    • Text summarization
      • ROUGE- Recall-Oriented Understudy for Gisting Evaluation
    • Regression [Stock Market Price predictions, Temperature Predictions]
      • Root Mean Square Error, Mean Absolute Percentage Error
    • Text Generation
      • BLEU (Bi-lingual Evaluation Understanding), Perplexity
    • Machine Translation
      • BLEU (Bi-lingual Evaluation Understanding), METEOR

37 of 37

NLP Pipeline- Deployment

  • Making a trained NLP model usable in a production setting is known as deployment
  • The precise deployment process can vary based on the platform and use case
    1. Export the trained model
    2. Prepare the input pipeline
    3. Set up the inference service
    4. Monitor performance and scale
    5. Continuous improvement