Bagging to BERT
A tour of Natural Language processing
Prepared for PyData NYC
Benjamin Batorsky, PhD
Download Data (reviews.pkl.gz): shorturl.at/joMSW OR https://ai.stanford.edu/~amaas/data/sentiment/
Github repo: https://github.com/bpben/bagging_to_bert
Google Collaboratory (recommended): https://colab.research.google.com/github/bpben/bagging_to_bert/blob/main/tutorial_notebook_part1.ipynb
https://muppet.fandom.com
SETUP
Who am I?
Explosion of data...unstructured data, that is
What is Natural Language?
What is Natural Language?
“A language that has developed naturally in use (as contrasted with an artificial language or computer code).” (Oxford Dictionary definition)
History, in short
Now we can do things like this
Write with Transformer from HuggingFace
And this:
Though also, this:
DALL-E Prompt: Flight attendant
DALL-E prompt: Lawyer
Text vs structured data
Structured data: Conforms to a particular data model, consistent order, quantifiable
What are some examples?
What is the difference between those and text?
Text vs structured data
Text doesn’t inherently have comparative values!
Sentences have long term dependencies, order changes
What is the point of NLP?
Goal: Ensure accurate response to input text
Ideal world: Infinite resources, read and respond correctly to every input
Real world: Need heuristics/automation
Goal: Ensure accurate response to informative representation of input text
NLP system should contain
Stops on our tour
The IMDB review dataset
What we’ll be using
Token: “Useful semantic unit”
“I am learning Natural Language Processing (NLP)”
<split on whitespace>
Unigrams
I, am, learning, Natural, Language, Processing, (NLP)
Bigrams
I am, am learning, learning Natural...
7-grams
I am learning Natural Language Processing (NLP)
“Bagging”
I am a Patriots fan
I am a Giants fan
am | a | fan | I | Patriots |
am | a | fan | I | Giants |
am | a | fan | I | Patriots | Giants |
1 | 1 | 1 | 1 | 1 | 0 |
1 | 1 | 1 | 1 | 0 | 1 |
Document-Term Matrix
The power of the document-term matrix (word count)
am | a | fan | I | Patriots | Giants |
1 | 1 | 1 | 1 | 1 | 0 |
1 | 1 | 1 | 1 | 0 | 1 |
Comedies
Histories
To the notebook - word counts
Sentiment analysis - our progress so far
Approach | Representation | F1 score |
Deterministic | Positive words - negative words | 0.57 |
Word count | Document-term matrix (DTM) | 0.87 |
Making word counts more informative
am | a | fan | I | Patriots | Giants |
1 | 1 | 1 | 1 | 1 | 0 |
1 | 1 | 1 | 1 | 0 | 1 |
Term Frequency - Inverse Document Frequency (TF-IDF)
Note: TFIDF usually has some additional “smoothing” transformations
| am | a | fan | I | Patriots | Giants |
Doc1 | 1 | 1 | 1 | 1 | 1 | 0 |
Doc2 | 1 | 1 | 1 | 1 | 0 | 1 |
T | DF | IDF | Doc1 TF | Doc2 TF | Doc1 TF*IDF | Doc2 TF*IDF |
Patriots | 1 | 1 | 1 | 0 | 1 | 0 |
Giants | 1 | 1 | 0 | 1 | 0 | 1 |
fan | 2 | 0.5 | 1 | 1 | 0.5 | 0.5 |
The difference between a Patriots fan and a Giants fan
Measuring similarity - “cosine similarity” measure comparing vectors
(higher = more similar)
Similarity (Doc1, Doc2) = 0.8
Similarity (TFIDF Doc1, TFIDF Doc2) = 0.5
| am | a | fan | I | Patriots | Giants |
TFIDF Doc1 | 0.5 | 0.5 | 0.5 | 0.5 | 1 | 0 |
TFIDF Doc2 | 0.5 | 0.5 | 0.5 | 0.5 | 0 | 1 |
To the notebook - TF-IDF
Sentiment analysis - our progress so far
Approach | Representation | F1 score |
Deterministic | Positive words - negative words | 0.57 |
Word count | Document-term matrix (DTM) | 0.87 |
TF-IDF | Weighted (DTM) | 0.87 (0.88 recall) |
Curse of dimensionality with word counts
Shakespeare’s plays
884k total words
28k unique words
https://www.opensourceshakespeare.org/statistics/
Topic models
Extracting axes of variation in data
Extracting axes of variation in data
Categorizing small/mid-size businesses
“...offers classes 7 days a week. Our vegan cafe opened in July of 2013... We also have a retail store selling a limited selection of US-made yoga gear...peruse the retail, enjoy the cafe, or get a massage with one of the body workers in the Wellness Center…”
Yoga studio, cafe AND retail?!
Topic models for informative “business representation”
To the notebooks - topic models
Sentiment analysis - our progress so far
Approach | Representation | F1 score |
Deterministic | Positive words - negative words | 0.57 |
Word count | Document-term matrix (DTM) | 0.87 |
TF-IDF | Weighted (DTM) | 0.87 (0.88 recall) |
Topic model (NMF) | Topic loadings (10 dimensions) | 0.76 |
This works on your current dataset
But what about a new dataset?
Transfer learning
Source task: term co-occurrence statistics
What does this tell you about the relationship between pie and cherry vs pie and digital?
Word embeddings: Informative word-level representations
Embeddings for words in job descriptions
Just quick details here
Considerations when using embeddings
To the notebooks - word embeddings
Sentiment analysis - our progress so far
Approach | Representation | F1 score |
Deterministic | Positive words - negative words | 0.57 |
Word count | Document-term matrix (DTM) | 0.87 |
TF-IDF | Weighted (DTM) | 0.87 (0.88 recall) |
Topic model (NMF) | Topic loadings (10 dimensions) | 0.76 |
Word2vec | Average word vector (200-d) | 0.84 |
Oddities of language
Why is this funny?
Oddities of language
Why is this funny?
One method to include context
A
[0, 0, 1, ...]
Word index
{“word_idx”:2, “length”:1, …}
Word features
[0.03, 0.01, -0.12 …]
Embedding
Tokens
Representation
good
movie
Contextualized Representation
(wi, wi+1)
{“w_i_idx”: [0,0,1,…], “w_i_1_idx”: [0,1,0,…]}
{“w_i_idx”: [0,0,1,…], “w_i_length”:1, “w_i_1_idx”: [0,1,0,…], “w_i_1_length”:4}
Recurrent Neural Networks
First state contribution to last state
To the notebooks - LSTM
Sentiment analysis - our progress so far
Approach | Representation | F1 score |
Deterministic | Positive words - negative words | 0.57 |
Word count | Document-term matrix (DTM) | 0.87 |
TF-IDF | Weighted (DTM) | 0.87 (0.88 recall) |
Topic model (NMF) | Topic loadings (10 dimensions) | 0.76 |
Word2vec | Average word vector (200-d) | 0.84 |
LSTM (5 epoch) | LSTM output for 200 tokens (200, 512) | 0.82 |
Issues with recurrent neural networks
Long-term dependencies
Ben watched a movie today, he did not like it.
To what does “he” refer?
To what does “it” refer?
Long-term dependencies - grammar
Ben watched a movie today, he did not like it.
Long-term dependencies - “attention”
Ben watched a movie today, he did not like it.
Visual of attention weight between tokens
Transformer models: Attention is all you need!
Source task: Predicting a word from context
I ___ the Patriots.
What should fill in the blank?
Source task: Predicting a word from context
I ___ the Patriots, I want them to win.
What should fill in the blank?
I ___ the Patriots, I want them to lose.
What about here?
Bi-directional Encoder Representations from Transformers (BERT)
To the notebooks - BERT
Sentiment analysis - our progress so far
Approach | Representation | F1 score |
Deterministic | Positive words - negative words | 0.57 |
Word count | Document-term matrix (DTM) | 0.87 |
TF-IDF | Weighted (DTM) | 0.87 (0.88 recall) |
Topic model (NMF) | Topic loadings (10 dimensions) | 0.76 |
Word2vec | Average word vector (200-d) | 0.84 |
LSTM (5 epoch) | LSTM output for 200 tokens (200, 512) | 0.82 |
BERT | BERT output for 512 tokens (512, 768) | 0.84 |
My advice: Start simple, add complexity
Thank you for coming!
Some additional materials
Get in touch!
Twitter: @bpben2
Github: bpben
https://ai.northeastern.edu/jobs/
If you’d like to work with the Institute:
https://ai.northeastern.edu/contact-us/
Appendix
The internals of self attention