2 of 28

Presentation Outline

Project Description
Fundamentals of Natural Language Processing (NLP) and Machine Learning (ML)

Definitions
Applications in real life
Two types of ML
General steps of NLP and ML

The Project

Data gathering and preparation
Data preprocessing
Text Vectorization
ML model building and evaluation
Conclusion and discussion

3 of 28

Project Description

Data --- Transcripts of eight years chat reference transactions (2014 – 2021)
Research Question --- There are many types of questions asked by chat users. But the big two categories are: reference questions and non-reference questions (including spams). Is it possible to build a smart filter that will differentiate the reference questions from non-reference questions?
Method and Goal --- Using the data available, I want to build a classification model by using NLP and ML techniques. This model might be used in the future to predict the category (ref. Q or non-ref. Q) of new questions asked by library patrons, probably improving the efficiency and effectiveness of the virtual reference services of the library.
Language used--- Python
Development Tool used --- Jupyter Notebook

4 of 28

Background

AI and Machine learning are everywhere

My knowledge and skills in computer science, and certificate in data science

How to apply all these to the library setting?

Project started in 2019

5 of 28

Definitions of NLP and ML on Wikipedia.

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data. Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions …

Relationship between NLP and ML?

Basically, NLP must involve ML; but ML doesn’t have to involve NLP.

6 of 28

Top Applications of NLP and ML

Sentiment analysis (e.g. social media)
Topic modeling or Text summarization (e.g. DH)
Classification/categorization (e.g. email spam filter)
Speech recognition (e.g. Google Assistant)
Voice assistants and Chat bots (e.g. Amazon Alexa)
Auto correct and auto completion (e.g. Primo search box)

…

7 of 28

Two Types of ML

Supervised learning (e.g. email spam filter)

Unsupervised learning (e.g. text summarization)

8 of 28

General Steps of NLP and ML – the pipeline

Data Gathering
Data Preprocessing

Removing punctuation
Changing all text to lowercase
Tokenization
Removing stop words
Stemming or lemmatization
Feature engineering (Optional)

Text Vectorization
Model Building, Evaluation, Optimization
Implementation

9 of 28

1. Data Gathering and Preparation

Downloaded the initial questions from the chat transcripts into an Excel file (about 8,000 questions)
Applied and got the college IRB approval
Removed blank rows and repeated questions, etc.
Manually labelled each question 🡪 reference question or non-reference question

10 of 28

Sample Questions with Yes or No Label�(what I consider non-reference question: spam, greeting only, noise complaint, printing problems, library services including ILL, reserves, renewal, non-questions such as “Sorry I was disconnected before,” etc.)

11 of 28

2. Data Preprocessing

In NLP, data preprocessing is an essential step in building a Machine Learning model. It transforms the raw text into a more digestible form so that ML algorithms can perform better and achieve the results we want.

Removing punctuation
Changing all text to lowercase
Tokenization (split the sentences into a list of individual words, basically removing spaces between words)
Removing stop words (e.g. remove “the, a, and, …” only keep pivotal words for the model)
Stemming or lemmatizing (we want just the semantic meaning of a group of related words. In other words, explicitly correlates words with similar meanings. e.g. run, running, runner -> run; library, libraries -> librari; goose and geese -> goose)
Feature engineering (Create a new feature, or transform the current feature. Presume that the lengths of reference questions and non-reference questions are somehow different. So I will create a new feature of question length and hope it will help our model distinguish between ref. Qs and non-ref. Qs.)

12 of 28

Histogram of question length distributions

from matplotlib import pyplot

import numpy as np

%matplotlib inline

bins = np.linspace(0, 300, 50)

pyplot.hist(data[data['label']=='Yes']['question_len'], bins, alpha=0.5, density=True, label='Yes')

pyplot.hist(data[data['label']=='No']['question_len'], bins, alpha=0.5, density=True, label='No')

pyplot.legend(loc='upper right')

pyplot.show()

Feature engineering evaluation by overlay histograms

13 of 28

Python Code – Import and preprocess data

14 of 28

Result of preprocessing and feature engineering

15 of 28

NLTK stop words list (179)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

16 of 28

3. Text Vectorization

Purpose: to transform the text data into numeric data so that the ML algorithms and Python can understand and use that data to build the model.
How it is conducted?

an n-dimensional vector of numerical features that present some object.
For example

1. Hello, What are the Library's Summer Hours?
2. Can I print in color in the library? and if so, how?

There are many types of text vectorization:

3 basic and popular ones are: Count, n-Grams, TF-IDF

17 of 28

Count Vectorization Example

18 of 28

TF-IDF vectorizer is used in this project. ( TF-IDF stands for “Term Frequency – Inverse Document Frequency” vectorizer. Because it’s weighted, it’s more accurate than count vectorizer.)��7,263 rows (questions), 6,872 columns (unique words)

19 of 28

4. Model Building and Evaluation

Two popular models:

Random Forest

Building many decision trees at the same time
Parallel computing
Decide by majority vote

Gradient Boosting

Build one decision tree one at a time
Each new tree helps to correct errors made by previously trained tree
Boosting (optimization) by reward or penalty

20 of 28

Metrics of model evaluation

The confusion matrix

True Positive (TP), True Negative (TN),
False Positive (FP), False Negative(FN)

Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)

The key is: what do you consider to be positive, what to be negative

21 of 28

Metrics of model evaluation: in our case

In our case: Ref. Question (yes label) = Positive; non-ref. question(no label) = Negative

Accuracy

Correctly predicted questions (both ref. and non-ref.) / Total questions

Precision

Correctly predicted ref. / (correctly predicted ref. + incorrectly predicted non-ref. )

Recall

Correctly predicted ref. / (correctly predicted ref. + incorrectly predicted ref. )

Now if we have total 100 questions. We predicted 90 correctly, which includes 60 ref. questions and 30 non-ref. questions. For the 10 questions we predicted wrong, there are 2 ref. question we predicted non ref. and 8 non ref. we predicted ref.

Accuracy = 90 / 100 = 0.9
Precision = 60 / (60 + 2) = 0.96
Recall = 60 / (60 + 8) = 0.88

Which one is more important, precision or recall? That depends on the real problem. In our case, precision is more important. We like to have the largest precision possible. That is, the least amount of incorrectly predicted non-reference questions.

22 of 28

Random forest model building�

23 of 28

Gradient boosting model building�

24 of 28

Result

Model selected:

Random Forest Classifier
n-estimator = 150
max-depth = none (unlimited depth)

Result:

Precision: 91.4%
Recall: 96.4%
Accuracy: 91.2%

(Note: cross validation, grid search)

25 of 28

What’s next?

Implementation?

Possible next step up --- Build a multi-class model to classify the questions into multiple categories, for example, article search, book search, noise complaint, library services, spam, etc.

26 of 28

Recap

Fundamentals of NLP and ML

Types of ML: Supervised learning vs. unsupervised learning
Apps of NLP and ML: Sentiment analyses, classification, topic modeling, …

ML pipeline

Data gathering
Data Preprocessing (punctuation, lower case, stop words, tokenize, stemming or lemmatizing)
Feature engineering (optional)
Text Vectorization (3 popular types: count, n-grams, TF-IDF)
Model building: training, testing, and evaluating

The Project

Two models: random forest model, gradient boosting model
Evaluation metrics: accuracy, prevision, recall

27 of 28

Question for everyone to think about:

Will someday the chat bot supplement, or, replace the real people to provide library virtual reference service in the library?

28 of 28

Thank you

Questions?

Yongming Wang, Systems Librarian, The College of New Jersey (TCNJ)
wangyo@tcnj.edu