1 of 28

Using Natural Language Processing and Machine Learning Techniques to Analyze Library Virtual Reference Data���Yongming Wang�The College of New Jersey��ENUG 2021�Online, Oct 2021

2 of 28

Presentation Outline

  • Project Description
  • Fundamentals of Natural Language Processing (NLP) and Machine Learning (ML)
    1. Definitions
    2. Applications in real life
    3. Two types of ML
    4. General steps of NLP and ML

  • The Project
    • Data gathering and preparation
    • Data preprocessing
    • Text Vectorization
    • ML model building and evaluation
    • Conclusion and discussion

3 of 28

Project Description

  • Data --- Transcripts of eight years chat reference transactions (2014 – 2021)
  • Research Question --- There are many types of questions asked by chat users. But the big two categories are: reference questions and non-reference questions (including spams). Is it possible to build a smart filter that will differentiate the reference questions from non-reference questions?
  • Method and Goal --- Using the data available, I want to build a classification model by using NLP and ML techniques. This model might be used in the future to predict the category (ref. Q or non-ref. Q) of new questions asked by library patrons, probably improving the efficiency and effectiveness of the virtual reference services of the library.
  • Language used--- Python
  • Development Tool used --- Jupyter Notebook

4 of 28

Background

  • AI and Machine learning are everywhere

  • My knowledge and skills in computer science, and certificate in data science

  • How to apply all these to the library setting?

  • Project started in 2019

5 of 28

Definitions of NLP and ML on Wikipedia.

    • Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

    • Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data. Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions …

    • Relationship between NLP and ML?
      • Basically, NLP must involve ML; but ML doesn’t have to involve NLP.

6 of 28

Top Applications of NLP and ML

  • Sentiment analysis (e.g. social media)
  • Topic modeling or Text summarization (e.g. DH)
  • Classification/categorization (e.g. email spam filter)
  • Speech recognition (e.g. Google Assistant)
  • Voice assistants and Chat bots (e.g. Amazon Alexa)
  • Auto correct and auto completion (e.g. Primo search box)

7 of 28

Two Types of ML

    • Supervised learning (e.g. email spam filter)

    • Unsupervised learning (e.g. text summarization)

8 of 28

General Steps of NLP and ML – the pipeline

  1. Data Gathering
  2. Data Preprocessing
    1. Removing punctuation
    2. Changing all text to lowercase
    3. Tokenization
    4. Removing stop words
    5. Stemming or lemmatization
    6. Feature engineering (Optional)
  3. Text Vectorization
  4. Model Building, Evaluation, Optimization
  5. Implementation

9 of 28

1. Data Gathering and Preparation

  • Downloaded the initial questions from the chat transcripts into an Excel file (about 8,000 questions)
  • Applied and got the college IRB approval
  • Removed blank rows and repeated questions, etc.
  • Manually labelled each question 🡪 reference question or non-reference question

10 of 28

Sample Questions with Yes or No Label�(what I consider non-reference question: spam, greeting only, noise complaint, printing problems, library services including ILL, reserves, renewal, non-questions such as “Sorry I was disconnected before,” etc.)

11 of 28

2. Data Preprocessing

In NLP, data preprocessing is an essential step in building a Machine Learning model. It transforms the raw text into a more digestible form so that ML algorithms can perform better and achieve the results we want.

    • Removing punctuation
    • Changing all text to lowercase
    • Tokenization (split the sentences into a list of individual words, basically removing spaces between words)
    • Removing stop words (e.g. remove “the, a, and, …” only keep pivotal words for the model)
    • Stemming or lemmatizing (we want just the semantic meaning of a group of related words. In other words, explicitly correlates words with similar meanings. e.g. run, running, runner -> run; library, libraries -> librari; goose and geese -> goose)
    • Feature engineering (Create a new feature, or transform the current feature. Presume that the lengths of reference questions and non-reference questions are somehow different. So I will create a new feature of question length and hope it will help our model distinguish between ref. Qs and non-ref. Qs.)

12 of 28

Histogram of question length distributions

from matplotlib import pyplot

import numpy as np

%matplotlib inline

bins = np.linspace(0, 300, 50)

pyplot.hist(data[data['label']=='Yes']['question_len'], bins, alpha=0.5, density=True, label='Yes')

pyplot.hist(data[data['label']=='No']['question_len'], bins, alpha=0.5, density=True, label='No')

pyplot.legend(loc='upper right')

pyplot.show()

Feature engineering evaluation by overlay histograms

13 of 28

Python Code – Import and preprocess data

14 of 28

Result of preprocessing and feature engineering

15 of 28

NLTK stop words list (179)

  • ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

16 of 28

3. Text Vectorization

  • Purpose: to transform the text data into numeric data so that the ML algorithms and Python can understand and use that data to build the model.
  • How it is conducted?
    • an n-dimensional vector of numerical features that present some object.
    • For example
      • 1. Hello, What are the Library's Summer Hours?
      • 2. Can I print in color in the library? and if so, how?

  • There are many types of text vectorization:
    • 3 basic and popular ones are: Count, n-Grams, TF-IDF

17 of 28

Count Vectorization Example

18 of 28

TF-IDF vectorizer is used in this project. ( TF-IDF stands for “Term Frequency – Inverse Document Frequency” vectorizer. Because it’s weighted, it’s more accurate than count vectorizer.)��7,263 rows (questions), 6,872 columns (unique words)

19 of 28

4. Model Building and Evaluation

Two popular models:

  • Random Forest
    • Building many decision trees at the same time
    • Parallel computing
    • Decide by majority vote

  • Gradient Boosting
    • Build one decision tree one at a time
    • Each new tree helps to correct errors made by previously trained tree
    • Boosting (optimization) by reward or penalty

20 of 28

Metrics of model evaluation

  • The confusion matrix
    • True Positive (TP), True Negative (TN),
    • False Positive (FP), False Negative(FN)

    • Accuracy = (TP + TN) / (TP + TN + FP + FN)
    • Precision = TP / (TP + FP)
    • Recall = TP / (TP + FN)

  • The key is: what do you consider to be positive, what to be negative

21 of 28

Metrics of model evaluation: in our case

    • In our case: Ref. Question (yes label) = Positive; non-ref. question(no label) = Negative

    • Accuracy
      • Correctly predicted questions (both ref. and non-ref.) / Total questions
    • Precision
      • Correctly predicted ref. / (correctly predicted ref. + incorrectly predicted non-ref. )
    • Recall
      • Correctly predicted ref. / (correctly predicted ref. + incorrectly predicted ref. )

    • Now if we have total 100 questions. We predicted 90 correctly, which includes 60 ref. questions and 30 non-ref. questions. For the 10 questions we predicted wrong, there are 2 ref. question we predicted non ref. and 8 non ref. we predicted ref.
      • Accuracy = 90 / 100 = 0.9
      • Precision = 60 / (60 + 2) = 0.96
      • Recall = 60 / (60 + 8) = 0.88
    • Which one is more important, precision or recall? That depends on the real problem. In our case, precision is more important. We like to have the largest precision possible. That is, the least amount of incorrectly predicted non-reference questions.

22 of 28

Random forest model building�

23 of 28

Gradient boosting model building�

24 of 28

Result

  • Model selected:
    • Random Forest Classifier
    • n-estimator = 150
    • max-depth = none (unlimited depth)
  • Result:
    • Precision: 91.4%
    • Recall: 96.4%
    • Accuracy: 91.2%

(Note: cross validation, grid search)

25 of 28

What’s next?

    • Implementation?

    • Possible next step up --- Build a multi-class model to classify the questions into multiple categories, for example, article search, book search, noise complaint, library services, spam, etc.

26 of 28

Recap

  • Fundamentals of NLP and ML
    • Types of ML: Supervised learning vs. unsupervised learning
    • Apps of NLP and ML: Sentiment analyses, classification, topic modeling, …
  • ML pipeline
    • Data gathering
    • Data Preprocessing (punctuation, lower case, stop words, tokenize, stemming or lemmatizing)
    • Feature engineering (optional)
    • Text Vectorization (3 popular types: count, n-grams, TF-IDF)
    • Model building: training, testing, and evaluating
  • The Project
    • Two models: random forest model, gradient boosting model
    • Evaluation metrics: accuracy, prevision, recall

27 of 28

Question for everyone to think about:

Will someday the chat bot supplement, or, replace the real people to provide library virtual reference service in the library?

28 of 28

Thank you

Questions?

  • Yongming Wang, Systems Librarian, The College of New Jersey (TCNJ)
  • wangyo@tcnj.edu