Using Natural Language Processing and Machine Learning Techniques to Analyze Library Virtual Reference Data����Yongming Wang�The College of New Jersey��ENUG 2021�Online, Oct 2021
Presentation Outline
Project Description
Background
Definitions of NLP and ML on Wikipedia.
Top Applications of NLP and ML
…
Two Types of ML
General Steps of NLP and ML – the pipeline
1. Data Gathering and Preparation
Sample Questions with Yes or No Label�(what I consider non-reference question: spam, greeting only, noise complaint, printing problems, library services including ILL, reserves, renewal, non-questions such as “Sorry I was disconnected before,” etc.)
2. Data Preprocessing
In NLP, data preprocessing is an essential step in building a Machine Learning model. It transforms the raw text into a more digestible form so that ML algorithms can perform better and achieve the results we want.
Histogram of question length distributions
from matplotlib import pyplot
import numpy as np
%matplotlib inline
bins = np.linspace(0, 300, 50)
pyplot.hist(data[data['label']=='Yes']['question_len'], bins, alpha=0.5, density=True, label='Yes')
pyplot.hist(data[data['label']=='No']['question_len'], bins, alpha=0.5, density=True, label='No')
pyplot.legend(loc='upper right')
pyplot.show()
Feature engineering evaluation by overlay histograms
Python Code – Import and preprocess data
Result of preprocessing and feature engineering
NLTK stop words list (179)
3. Text Vectorization
Count Vectorization Example
TF-IDF vectorizer is used in this project. ( TF-IDF stands for “Term Frequency – Inverse Document Frequency” vectorizer. Because it’s weighted, it’s more accurate than count vectorizer.)��7,263 rows (questions), 6,872 columns (unique words)
4. Model Building and Evaluation
Two popular models:
Metrics of model evaluation
Metrics of model evaluation: in our case
Random forest model building�
Gradient boosting model building�
Result
(Note: cross validation, grid search)
What’s next?
Recap
Question for everyone to think about:
Will someday the chat bot supplement, or, replace the real people to provide library virtual reference service in the library?
Thank you
Questions?