1 of 30

Introduction to

Natural Language Processing

Jian Tao

jtao@tamu.edu

09/19/2023

2 of 30

Part I. Basics of Natural Language Processing

Credit: practicalnlp.ai

3 of 30

Introduction

  • Natural language processing (NLP) is about utilizing computers to process and analyze natural language data.
  • History
    • Symbolic NLP (1950s – early 1990s): rule-based methods
    • Statistical NLP (1990s-2010s): statistical methods
    • Neural NLP (2010s - present): neural networks
      • Large Language Models (2018 - present)

4 of 30

A Brief History of NLP

5 of 30

Applications of NLP

NLP is commonly used in:

  • Text and speech processing
    • Optical Character Recognition (OCR), speech recognition, text-to-speech, word segmentation, translation, text generation
  • Morphological analysis
    • Lemmatization, part-of-speech tagging
  • Syntactic analysis
    • Grammar induction, sentence breaking, parsing
  • Semantics
    • Name entity recognition, semantic parsing
  • Pragmatics
    • sentiment analysis, topic analysis, machine translation

Context

Meaning

Syntax

Phrases & Sentences

Morphemes & Lexemes

Phonemes

Speech & Sounds

Words

Summarization

Topic Modeling Sentiment Analysis

Parsing

Entity Extraction

Relation Extraction

Tokenization

Word Embeddings

POS Tagging

Speech to Text

Speaker Identification

Text to Speech

Blocks of Language

Applications

6 of 30

Applications of NLP - Continue

Natural Language Processing (NLP) has a broad range of applications across various domains and industries.

  • Search Engines: NLP powers search engines to improving the relevancy of search results.
  • Speech Recognition: Google's Voice Search and Apple's Siri use NLP to convert spoken language into text.
  • Machine Translation: E.g., Google Translate use NLP to translate text between different languages in real-time.
  • Chatbots and Virtual Assistants: Chatbots on websites and virtual assistants like Siri, Alexa, and Cortana use NLP to support human computer interaction.

7 of 30

Applications of NLP - Continue

Some more applications

  • Spell and Grammar Check: NLP is behind tools like Grammarly that check and correct the user's grammar and spelling.
  • Information Extraction: Extracting structured information from unstructured text documents, such as names, dates, organizations, or places.
  • Recommendation Systems: NLP helps content platforms (like Netflix or news websites) understand, categorize, and recommend content.
  • Medical and Legal Business: Extracting and analyzing data from health or legal records
  • E-learning and Tutoring: Personalized tutoring systems can provide feedback and assistance based on student-written content.

8 of 30

NLP - Classical Approach

Tokenization

POS Tagging

Stopword Removal

Inference

Modeling

Feature

Extraction

Sentiment

Classification

Entity Extraction

…….

Translation

Topic Modeling

Pre-processing

Output

Modeling

9 of 30

NLP - Deep Learning-Based Approach

Sentiment

Classification

Entity Extraction

…….

Translation

Topic Modeling

Pre-Processing

Dense Embeddings

Hidden Layers

Output Units

Output

Documents

10 of 30

Pipeline of NLP Applications

Data Acquisition

Text Cleaning

Pre-Processing

Feature

Engineering

Monitoring &

Model Updating

Deployment

Evaluation

Modeling

Improving

the model

11 of 30

Pre-processing for NLP

Text

Sentence

Sentences

Sentence

Tokenization

Lowercasing

Removal of

Punctuation

Stemming

Lemmatization

Basic

12 of 30

Pre-processing - an Example

Credit: practicalnlp.ai

* POS: part-of-speech

13 of 30

Text Processing - Bag of Words

Bag of Words

  • The text is represented as a bag (collection) of words while ignoring the order and context. The primary assumption is that the text belonging to a given class in the dataset is characterized by a unique set of words.

Bag of N-Grams

  • The text is broken into chunks of n contiguous words (or tokens). Each chunk is called an n-gram.

14 of 30

Text Processing - TF-IDF

Term Frequency–Inverse Document Frequency (TF-IDF)

  • TF (term frequency) measures how often a term or word occurs in a given document.
  • IDF (inverse document frequency) measures the importance of the term across a corpus.
  • TF-IDF aims to quantify the importance of a given term t relative to other words in the document d and in the corpus.

15 of 30

Text Classification - Common Classifiers

  • Naive Bayes Classifier:

A probabilistic classifier that uses Bayes’ theorem to classify texts based on the evidence seen in training data.

  • Logistic Regression:

A machine learning method to train a linear separator between classes in the training data with the aim of maximizing the probability of the data.

  • Support Vector Machine

A support vector machine (SVM) is a discriminative classifier like logistic regression. It looks for an optimal hyperplane in a higher dimensional space, to separate the classes by a maximum possible margin.

  • Deep Neural Network

A family of machine learning algorithms utilizing different types of multilayered neural networks, such as, CNN, RNN, LSTM, etc.

16 of 30

Text Classification - a Typical Workflow

  1. Collect or create a labeled dataset suitable for the task.

  • Split the dataset into two (training and test) or three parts: training, validation (i.e., development), and test sets, then decide on evaluation metric(s).

  • Transform raw text into feature vectors.

  • Train a classifier using the feature vectors and the corresponding labels from the training set.

  • Using the evaluation metric(s) from Step 2, benchmark the model performance on the test set.

  • Deploy the model to serve the real-world use case and monitor its performance.

17 of 30

Large Language Models

Image credit: Hugging Face

Large language models (LLMs) are a category of machine learning models specifically designed to handle, generate, and understand text data at a vast scale.

18 of 30

Large Language Models - Landscape

19 of 30

Lllama 2 by Meta

20 of 30

Lllama 2 by Meta

21 of 30

Supervised Fine-tuning for LLMs

Internet Scale Dataset

Base LLM

Supervised Learning

Fine-tuned LLM

Specific Knowledge Base

Documents on Internet

Meta Llama 2

Code-specific Datasets

Model that can generate code in popular programming languages like Python, C++, Java, PHP, Typescript (Javascript), C#, Bash and more.

From Llama to Code Llama

22 of 30

Supervised Fine-tuning for LLMs

Internet Scale Dataset

Base LLM

Supervised Learning

Fine-tuned LLM

Specific Knowledge Base

Code on Github

Meta Code Llama

My New Language X

Model that can generate code in X

From Code Llama to Customized Llama

23 of 30

Using LLMs

24 of 30

Part II. Hands-on Session:

Basic NLP with NLTK

25 of 30

Appendix

26 of 30

LICENSE AND DISCLAIMER

MIT License

Copyright (c) 2022 Jian Tao

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

27 of 30

GitHub Repository for the Webinars

28 of 30

Jupyter Notebook and JupyterLab

Jupyter Notebook

JupyterLab

29 of 30

Google Colaboratory

30 of 30

Google Colaboratory

Search GitHub user: jtao/dswebinar