1 of 62

Bagging to BERT

A tour of Natural Language processing

Prepared for PyData NYC

Benjamin Batorsky, PhD

Download Data (reviews.pkl.gz): shorturl.at/joMSW OR https://ai.stanford.edu/~amaas/data/sentiment/

Github repo: https://github.com/bpben/bagging_to_bert

Google Collaboratory (recommended): https://colab.research.google.com/github/bpben/bagging_to_bert/blob/main/tutorial_notebook_part1.ipynb

https://colab.research.google.com/github/bpben/bagging_to_bert/blob/main/tutorial_notebook_part2.ipynb

https://muppet.fandom.com

SETUP

2 of 62

Who am I?

PhD, Policy Analysis
City of Boston Analytics Team
ThriveHive, Marketing Data Science
MIT, Food Supply Chain
Harvard, NLP instructor
Ciox Health, Clinical NLP
Northeastern EAI, Data Science solutions

Building AI solutions for partners across industries
Bridging academia and industry
Tackling research questions around AI applications and ethics

3 of 62

Explosion of data...unstructured data, that is

Chart:https://www.datanami.com/2019/01/14/from-oscar-to-ai-mining-visual-assets-for-fun-and-profit/

Data: IDC

https://www.domo.com/learn/infographic/data-never-sleeps-9

4 of 62

What is Natural Language?

5 of 62

What is Natural Language?

“A language that has developed naturally in use (as contrasted with an artificial language or computer code).” (Oxford Dictionary definition)

https://www.techbeamers.com/python-for-loop/

https://en.wikipedia.org/wiki/Dependency_grammar

6 of 62

History, in short

Yoav Goldberg: The missing elements in NLP (spaCy IRL 2019)

7 of 62

Now we can do things like this

Write with Transformer from HuggingFace

Write With Transformer distil-gpt2

8 of 62

And this:

https://openai.com/blog/dall-e/

9 of 62

Though also, this:

DALL-E Prompt: Flight attendant

DALL-E prompt: Lawyer

https://twitter.com/WriteArthur/status/1512429306349248512

10 of 62

Text vs structured data

Structured data: Conforms to a particular data model, consistent order, quantifiable

What are some examples?

What is the difference between those and text?

11 of 62

Text vs structured data

Height/weight - Numeric values, 6’0” > 5’0”, 6’0” = 3’0” * 2, 1 lb = 16 oz

Text doesn’t inherently have comparative values!

Stock ticker - State information available, day 2 follows day 1, prices are numeric

Sentences have long term dependencies, order changes

12 of 62

What is the point of NLP?

Goal: Ensure accurate response to input text

Ideal world: Infinite resources, read and respond correctly to every input

Real world: Need heuristics/automation

Goal: Ensure accurate response to informative representation of input text

NLP system should contain

Method for creating informative representation
Method for utilizing that informative representation for application

13 of 62

Stops on our tour

Tokenization
Word frequencies
Weighted word frequencies (TF-IDF)
Topic models
Word embeddings
Recurrent Neural Models
Large Language Models (e.g. BERT)

14 of 62

The IMDB review dataset

Source: http://ai.stanford.edu/~amaas/data/sentiment/
50k unique movie reviews, labelled for sentiment (positive vs negative)
Why this dataset?

Easily accessible, reasonable size (84 MB)
Simple, balanced, binary objective (positive/negative)
Short, clean passages (~1k characters on average)

What’s missing

Issues of size, cleanliness and clarity of target

15 of 62

What we’ll be using

Scikit-learn

Feature engineering modules for performant word vectorization
“Topic modelling” with Non-negative matrix factorization
Classification models

SpaCy (https://spacy.io/)

All-purpose NLP library

Transformers (https://huggingface.co/docs/transformers/index)

Transformer-based language models

PyTorch

16 of 62

Token: “Useful semantic unit”

Token - “useful semantic unit”

Breaking text into pieces
Can be “whitespace”-split, characters, etc

“N-gram” - N continuous tokens
Tokenization strategy

Extremely important for system design

This presentation

Whitespace-split, unigrams

“I am learning Natural Language Processing (NLP)”

Unigrams

I, am, learning, Natural, Language, Processing, (NLP)

Bigrams

I am, am learning, learning Natural...

7-grams

I am learning Natural Language Processing (NLP)

17 of 62

“Bagging”

I am a Patriots fan

I am a Giants fan

am	a	fan	I	Patriots

am	a	fan	I	Giants

am	a	fan	I	Patriots	Giants
1	1	1	1	1	0
1	1	1	1	0	1

Document-Term Matrix

18 of 62

The power of the document-term matrix (word count)

am	a	fan	I	Patriots	Giants
1	1	1	1	1	0
1	1	1	1	0	1

https://web.stanford.edu/~jurafsky/slp3/6.pdf

Comedies

Histories

19 of 62

To the notebook - word counts

20 of 62

Sentiment analysis - our progress so far

Approach	Representation	F1 score
Deterministic	Positive words - negative words	0.57
Word count	Document-term matrix (DTM)	0.87

21 of 62

Making word counts more informative

NLP: Informative representation of text
Raw word count = each word counted the same

“I am a Patriots fan” vs “I am a Giants fan”

Reduce “noise”

Turn words into common form

“I am” and “I will” -> “I be”

Stripping uninformative words

e.g. “the”, “and”

Weighting

Important words count more, unimportant words count less

am	a	fan	I	Patriots	Giants
1	1	1	1	1	0
1	1	1	1	0	1

22 of 62

Term Frequency - Inverse Document Frequency (TF-IDF)

Term frequency: Count of term (T) within a document
Document frequency (DF)

Documents with T

Inverse document frequency (IDF)

1 / DF
High DF (common term) = low IDF
Lower DF (uncommon term) = high IDF

TF*IDF, term count weighted by how “informative” that term is

Note: TFIDF usually has some additional “smoothing” transformations

	am	a	fan	I	Patriots	Giants
Doc1	1	1	1	1	1	0
Doc2	1	1	1	1	0	1

T	DF	IDF	Doc1 TF	Doc2 TF	Doc1 TF*IDF	Doc2 TF*IDF
Patriots	1	1	1	0	1	0
Giants	1	1	0	1	0	1
fan	2	0.5	1	1	0.5	0.5

23 of 62

The difference between a Patriots fan and a Giants fan

Measuring similarity - “cosine similarity” measure comparing vectors

(higher = more similar)

Similarity (Doc1, Doc2) = 0.8

Similarity (TFIDF Doc1, TFIDF Doc2) = 0.5

	am	a	fan	I	Patriots	Giants
TFIDF Doc1	0.5	0.5	0.5	0.5	1	0
TFIDF Doc2	0.5	0.5	0.5	0.5	0	1

24 of 62

To the notebook - TF-IDF

25 of 62

Sentiment analysis - our progress so far

Approach	Representation	F1 score
Deterministic	Positive words - negative words	0.57
Word count	Document-term matrix (DTM)	0.87
TF-IDF	Weighted (DTM)	0.87 (0.88 recall)

26 of 62

Curse of dimensionality with word counts

http://www.tylervigen.com/literature-statistics

Shakespeare’s plays

884k total words

28k unique words

https://www.opensourceshakespeare.org/statistics/

27 of 62

Topic models

“Topic models are algorithms for discovering the main themes that pervade a large and otherwise unstructured collection of documents” (Blei 2012)
NLP - Informative representation of text
Document = f(Topics), Topics = g(words)

Typically number of topics << size of vocabulary
Want to minimize the information lost by representing in this way

28 of 62

Extracting axes of variation in data

https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

29 of 62

Extracting axes of variation in data

https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

30 of 62

Categorizing small/mid-size businesses

Small/Mid-sized businesses that straddle multiple categories
Customer questions

Sales: “Which businesses are similar to this lead?”
Marketing: “How do we better personalize ad campaign messaging?”

Business websites rich source for services offered

“...offers classes 7 days a week. Our vegan cafe opened in July of 2013... We also have a retail store selling a limited selection of US-made yoga gear...peruse the retail, enjoy the cafe, or get a massage with one of the body workers in the Wellness Center…”

Yoga studio, cafe AND retail?!

31 of 62

Topic models for informative “business representation”

Topic modelling

Website text to TF-IDF vectors
Non-negative matrix factorization (NMF)

Output

Business-level representation in “topic space”
Calculate business-business similarity
Split into “similar” groups, based on parameters
Other predictive models

32 of 62

To the notebooks - topic models

33 of 62

Sentiment analysis - our progress so far

Approach	Representation	F1 score
Deterministic	Positive words - negative words	0.57
Word count	Document-term matrix (DTM)	0.87
TF-IDF	Weighted (DTM)	0.87 (0.88 recall)
Topic model (NMF)	Topic loadings (10 dimensions)	0.76

34 of 62

This works on your current dataset

https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

35 of 62

But what about a new dataset?

36 of 62

Transfer learning

https://www.cse.ust.hk/~qyang/Docs/2009/tkde_transfer_learning.pdf

37 of 62

Source task: term co-occurrence statistics

What does this tell you about the relationship between pie and cherry vs pie and digital?

https://web.stanford.edu/~jurafsky/slp3/6.pdf

38 of 62

Word embeddings: Informative word-level representations

“You shall know a word by the company it keeps” J.R. Firth (English Linguist)
Learn an numerical vector for each word based on context

Word2Vec: Neural model
GloVe: Corpus-based statistical model

Distance between words has meaning

Similar words = similar vectors
Madrid:Spain as Rome:Italy

Dimensions themselves not (readily) interpretable

[1301.3781] Efficient Estimation of Word Representations in Vector Space

39 of 62

Embeddings for words in job descriptions

Just quick details here

Applying Dynamic Embeddings in Natural Language Processing to track the Evolution of Tech Skills | Maryam Jahanshahi

40 of 62

Considerations when using embeddings

Pre-trained embeddings are widely available

Often trained on general internet
Can find domain-specific

Example, biomedical: https://allenai.github.io/scispacy/

Caution!

Bias in text = bias in embeddings

Gender bias in adjectives reflects changing mindsets

Word embeddings quantify 100 years of gender and ethnic stereotypes | PNAS

41 of 62

To the notebooks - word embeddings

42 of 62

Sentiment analysis - our progress so far

Approach	Representation	F1 score
Deterministic	Positive words - negative words	0.57
Word count	Document-term matrix (DTM)	0.87
TF-IDF	Weighted (DTM)	0.87 (0.88 recall)
Topic model (NMF)	Topic loadings (10 dimensions)	0.76
Word2vec	Average word vector (200-d)	0.84

43 of 62

Oddities of language

Why is this funny?

44 of 62

Oddities of language

Why is this funny?

“Homonym” - Same spelling or pronunciation, different meaning
Context matters!
Bagging - word counts independent from one another
GloVe/Word2Vec - one vector per word

45 of 62

One method to include context

A

[0, 0, 1, ...]

Word index

{“word_idx”:2, “length”:1, …}

Word features

[0.03, 0.01, -0.12 …]

Embedding

Tokens

Representation

good

movie

Contextualized Representation

(w_i, w_i+1)

{“w_i_idx”: [0,0,1,…], “w_i_1_idx”: [0,1,0,…]}

{“w_i_idx”: [0,0,1,…], “w_i_length”:1, “w_i_1_idx”: [0,1,0,…], “w_i_1_length”:4}

46 of 62

Recurrent Neural Networks

Information from previous states maintained in “hidden state”
Problem:

Longer sequences - less information from early stages

Various methods for “forgetting” and “remembering” specific information

LSTM - Long Short-Term Memory

Illustrated Guide to Recurrent Neural Networks | by Michael Phi | Towards Data Science

First state contribution to last state

47 of 62

To the notebooks - LSTM

48 of 62

Sentiment analysis - our progress so far

Approach	Representation	F1 score
Deterministic	Positive words - negative words	0.57
Word count	Document-term matrix (DTM)	0.87
TF-IDF	Weighted (DTM)	0.87 (0.88 recall)
Topic model (NMF)	Topic loadings (10 dimensions)	0.76
Word2vec	Average word vector (200-d)	0.84
LSTM (5 epoch)	LSTM output for 200 tokens (200, 512)	0.82

49 of 62

Issues with recurrent neural networks

Long training time

Sequence models hard to parallelize, each step dependent on previous

Issues of “forgetting” with long passages

Language has long-term dependencies
LSTM, Bi-directional LSTM don’t necessarily solve this

Parallel Neural Networks and Batch Sizes | Cerebras

50 of 62

Long-term dependencies

Ben watched a movie today, he did not like it.

To what does “he” refer?

To what does “it” refer?

51 of 62

Long-term dependencies - grammar

Ben watched a movie today, he did not like it.

52 of 62

Long-term dependencies - “attention”

Ben watched a movie today, he did not like it.

Visual of attention weight between tokens

53 of 62

Transformer models: Attention is all you need!

Encoder: Translates from input to “encoded” space

View over entire sequence

Decoder: Translates from encoded to output

Encoder output + previous decoder output

Attention incorporated throughout
Remove need for “recurrence”

Sequence position as a “positional encoding”

[1706.03762] Attention Is All You Need

54 of 62

Source task: Predicting a word from context

I ___ the Patriots.

What should fill in the blank?

55 of 62

Source task: Predicting a word from context

I ___ the Patriots, I want them to win.

What should fill in the blank?

I ___ the Patriots, I want them to lose.

What about here?

56 of 62

Bi-directional Encoder Representations from Transformers (BERT)

Transformer Language Model

Encoder+Decoder
Trained to predict next token
Output product of encoder + previous output

BERT

Encoder-only
Trained to predict masked/replaced token
Each output is a product of the entire sequence

https://jalammar.github.io/illustrated-bert/

57 of 62

To the notebooks - BERT

58 of 62

Sentiment analysis - our progress so far

Approach	Representation	F1 score
Deterministic	Positive words - negative words	0.57
Word count	Document-term matrix (DTM)	0.87
TF-IDF	Weighted (DTM)	0.87 (0.88 recall)
Topic model (NMF)	Topic loadings (10 dimensions)	0.76
Word2vec	Average word vector (200-d)	0.84
LSTM (5 epoch)	LSTM output for 200 tokens (200, 512)	0.82
BERT	BERT output for 512 tokens (512, 768)	0.84

59 of 62

My advice: Start simple, add complexity

Method for creating informative representation

Word counts, weighted word counts (TF-IDF)

Experiment with vocabulary and weights

Word embeddings

Experiment with sources, aggregations

Contextualized word embeddings

Try hand-curation (e.g. next-word embedding)
Bring in big guns (e.g. BERT)

Method for utilizing that informative representation for application

Corpus statistics (e.g. log-likelihood of words)
Similarity between words or documents (e.g. cosine similarity)
Classifier (e.g. regression)
Sequence tagging (e.g. named-entity recognition)
Language generation (predict next word)

60 of 62

Thank you for coming!

Some additional materials

spaCy universe - add-ons/integrations to spaCy

Scispacy - biomedical spaCy models

HuggingFace - datasets, models, and libraries, oh my!
Me

Smarter people

Sebastian Ruder - https://ruder.io/
Jay Alammar - https://jalammar.github.io/
Lilian Weng - https://lilianweng.github.io/
Speech and Language Processing by Dan Jurafsky and James Martin

Get in touch!

https://benbatorsky.com/

Twitter: @bpben2

Github: bpben

https://ai.northeastern.edu/jobs/

If you’d like to work with the Institute:

https://ai.northeastern.edu/contact-us/

61 of 62

Appendix

62 of 62

The internals of self attention

Illustrated: Self-Attention | Medium | Raimi Karim