1 of 62

Bagging to BERT

A tour of Natural Language processing

Prepared for PyData NYC

Benjamin Batorsky, PhD

https://muppet.fandom.com

SETUP

2 of 62

Who am I?

  • PhD, Policy Analysis
  • City of Boston Analytics Team
  • ThriveHive, Marketing Data Science
  • MIT, Food Supply Chain
  • Harvard, NLP instructor
  • Ciox Health, Clinical NLP
  • Northeastern EAI, Data Science solutions
  • Building AI solutions for partners across industries
  • Bridging academia and industry
  • Tackling research questions around AI applications and ethics

3 of 62

Explosion of data...unstructured data, that is

4 of 62

What is Natural Language?

5 of 62

What is Natural Language?

“A language that has developed naturally in use (as contrasted with an artificial language or computer code).” (Oxford Dictionary definition)

6 of 62

History, in short

7 of 62

Now we can do things like this

Write with Transformer from HuggingFace

Write With Transformer distil-gpt2

8 of 62

And this:

9 of 62

Though also, this:

DALL-E Prompt: Flight attendant

DALL-E prompt: Lawyer

10 of 62

Text vs structured data

Structured data: Conforms to a particular data model, consistent order, quantifiable

What are some examples?

What is the difference between those and text?

11 of 62

Text vs structured data

  • Height/weight - Numeric values, 6’0” > 5’0”, 6’0” = 3’0” * 2, 1 lb = 16 oz

Text doesn’t inherently have comparative values!

  • Stock ticker - State information available, day 2 follows day 1, prices are numeric

Sentences have long term dependencies, order changes

12 of 62

What is the point of NLP?

Goal: Ensure accurate response to input text

Ideal world: Infinite resources, read and respond correctly to every input

Real world: Need heuristics/automation

Goal: Ensure accurate response to informative representation of input text

NLP system should contain

  • Method for creating informative representation
  • Method for utilizing that informative representation for application

13 of 62

Stops on our tour

  • Tokenization
  • Word frequencies
  • Weighted word frequencies (TF-IDF)
  • Topic models
  • Word embeddings
  • Recurrent Neural Models
  • Large Language Models (e.g. BERT)

14 of 62

The IMDB review dataset

  • Source: http://ai.stanford.edu/~amaas/data/sentiment/
  • 50k unique movie reviews, labelled for sentiment (positive vs negative)
  • Why this dataset?
    • Easily accessible, reasonable size (84 MB)
    • Simple, balanced, binary objective (positive/negative)
    • Short, clean passages (~1k characters on average)
  • What’s missing
    • Issues of size, cleanliness and clarity of target

15 of 62

What we’ll be using

  • Scikit-learn
    • Feature engineering modules for performant word vectorization
    • “Topic modelling” with Non-negative matrix factorization
    • Classification models
  • SpaCy (https://spacy.io/)
    • All-purpose NLP library
  • Transformers (https://huggingface.co/docs/transformers/index)
    • Transformer-based language models
  • PyTorch

16 of 62

Token: “Useful semantic unit”

  • Token - “useful semantic unit”
    • Breaking text into pieces
    • Can be “whitespace”-split, characters, etc
  • “N-gram” - N continuous tokens
  • Tokenization strategy
    • Extremely important for system design
  • This presentation
    • Whitespace-split, unigrams

“I am learning Natural Language Processing (NLP)”

<split on whitespace>

Unigrams

I, am, learning, Natural, Language, Processing, (NLP)

Bigrams

I am, am learning, learning Natural...

7-grams

I am learning Natural Language Processing (NLP)

17 of 62

“Bagging”

I am a Patriots fan

I am a Giants fan

am

a

fan

I

Patriots

am

a

fan

I

Giants

am

a

fan

I

Patriots

Giants

1

1

1

1

1

0

1

1

1

1

0

1

Document-Term Matrix

18 of 62

The power of the document-term matrix (word count)

am

a

fan

I

Patriots

Giants

1

1

1

1

1

0

1

1

1

1

0

1

Comedies

Histories

19 of 62

To the notebook - word counts

20 of 62

Sentiment analysis - our progress so far

Approach

Representation

F1 score

Deterministic

Positive words - negative words

0.57

Word count

Document-term matrix (DTM)

0.87

21 of 62

Making word counts more informative

  • NLP: Informative representation of text
  • Raw word count = each word counted the same
    • “I am a Patriots fan” vs “I am a Giants fan”
  • Reduce “noise”
    • Turn words into common form
      • “I am” and “I will” -> “I be”
    • Stripping uninformative words
      • e.g. “the”, “and”
  • Weighting
    • Important words count more, unimportant words count less

am

a

fan

I

Patriots

Giants

1

1

1

1

1

0

1

1

1

1

0

1

22 of 62

Term Frequency - Inverse Document Frequency (TF-IDF)

  • Term frequency: Count of term (T) within a document
  • Document frequency (DF)
    • Documents with T
  • Inverse document frequency (IDF)
    • 1 / DF
    • High DF (common term) = low IDF
    • Lower DF (uncommon term) = high IDF
  • TF*IDF, term count weighted by how “informative” that term is

Note: TFIDF usually has some additional “smoothing” transformations

am

a

fan

I

Patriots

Giants

Doc1

1

1

1

1

1

0

Doc2

1

1

1

1

0

1

T

DF

IDF

Doc1

TF

Doc2

TF

Doc1 TF*IDF

Doc2

TF*IDF

Patriots

1

1

1

0

1

0

Giants

1

1

0

1

0

1

fan

2

0.5

1

1

0.5

0.5

23 of 62

The difference between a Patriots fan and a Giants fan

Measuring similarity - “cosine similarity” measure comparing vectors

(higher = more similar)

Similarity (Doc1, Doc2) = 0.8

Similarity (TFIDF Doc1, TFIDF Doc2) = 0.5

am

a

fan

I

Patriots

Giants

TFIDF

Doc1

0.5

0.5

0.5

0.5

1

0

TFIDF

Doc2

0.5

0.5

0.5

0.5

0

1

24 of 62

To the notebook - TF-IDF

25 of 62

Sentiment analysis - our progress so far

Approach

Representation

F1 score

Deterministic

Positive words - negative words

0.57

Word count

Document-term matrix (DTM)

0.87

TF-IDF

Weighted (DTM)

0.87 (0.88 recall)

26 of 62

Curse of dimensionality with word counts

Shakespeare’s plays

884k total words

28k unique words

https://www.opensourceshakespeare.org/statistics/

27 of 62

Topic models

  • “Topic models are algorithms for discovering the main themes that pervade a large and otherwise unstructured collection of documents” (Blei 2012)
  • NLP - Informative representation of text
  • Document = f(Topics), Topics = g(words)
    • Typically number of topics << size of vocabulary
    • Want to minimize the information lost by representing in this way

28 of 62

Extracting axes of variation in data

29 of 62

Extracting axes of variation in data

30 of 62

Categorizing small/mid-size businesses

  • Small/Mid-sized businesses that straddle multiple categories
  • Customer questions
    • Sales: “Which businesses are similar to this lead?”
    • Marketing: “How do we better personalize ad campaign messaging?”
  • Business websites rich source for services offered

“...offers classes 7 days a week. Our vegan cafe opened in July of 2013... We also have a retail store selling a limited selection of US-made yoga gear...peruse the retail, enjoy the cafe, or get a massage with one of the body workers in the Wellness Center…”

Yoga studio, cafe AND retail?!

31 of 62

Topic models for informative “business representation”

  • Topic modelling
    • Website text to TF-IDF vectors
    • Non-negative matrix factorization (NMF)
  • Output
    • Business-level representation in “topic space”
    • Calculate business-business similarity
    • Split into “similar” groups, based on parameters
    • Other predictive models

32 of 62

To the notebooks - topic models

33 of 62

Sentiment analysis - our progress so far

Approach

Representation

F1 score

Deterministic

Positive words - negative words

0.57

Word count

Document-term matrix (DTM)

0.87

TF-IDF

Weighted (DTM)

0.87 (0.88 recall)

Topic model (NMF)

Topic loadings (10 dimensions)

0.76

34 of 62

This works on your current dataset

35 of 62

But what about a new dataset?

36 of 62

Transfer learning

37 of 62

Source task: term co-occurrence statistics

What does this tell you about the relationship between pie and cherry vs pie and digital?

38 of 62

Word embeddings: Informative word-level representations

  • “You shall know a word by the company it keeps” J.R. Firth (English Linguist)
  • Learn an numerical vector for each word based on context
    • Word2Vec: Neural model
    • GloVe: Corpus-based statistical model
  • Distance between words has meaning
    • Similar words = similar vectors
    • Madrid:Spain as Rome:Italy
  • Dimensions themselves not (readily) interpretable

39 of 62

Embeddings for words in job descriptions

Just quick details here

40 of 62

Considerations when using embeddings

  • Pre-trained embeddings are widely available
    • Often trained on general internet
    • Can find domain-specific
  • Caution!
    • Bias in text = bias in embeddings
  • Gender bias in adjectives reflects changing mindsets

41 of 62

To the notebooks - word embeddings

42 of 62

Sentiment analysis - our progress so far

Approach

Representation

F1 score

Deterministic

Positive words - negative words

0.57

Word count

Document-term matrix (DTM)

0.87

TF-IDF

Weighted (DTM)

0.87 (0.88 recall)

Topic model (NMF)

Topic loadings (10 dimensions)

0.76

Word2vec

Average word vector (200-d)

0.84

43 of 62

Oddities of language

Why is this funny?

44 of 62

Oddities of language

Why is this funny?

  • “Homonym” - Same spelling or pronunciation, different meaning
  • Context matters!
  • Bagging - word counts independent from one another
  • GloVe/Word2Vec - one vector per word

45 of 62

One method to include context

A

[0, 0, 1, ...]

Word index

{“word_idx”:2, “length”:1, …}

Word features

[0.03, 0.01, -0.12 …]

Embedding

Tokens

Representation

good

movie

Contextualized Representation

(wi, wi+1)

{“w_i_idx”: [0,0,1,…], “w_i_1_idx”: [0,1,0,…]}

{“w_i_idx”: [0,0,1,…], “w_i_length”:1, “w_i_1_idx”: [0,1,0,…], “w_i_1_length”:4}

46 of 62

Recurrent Neural Networks

  • Information from previous states maintained in “hidden state”
  • Problem:
    • Longer sequences - less information from early stages
  • Various methods for “forgetting” and “remembering” specific information
    • LSTM - Long Short-Term Memory

First state contribution to last state

47 of 62

To the notebooks - LSTM

48 of 62

Sentiment analysis - our progress so far

Approach

Representation

F1 score

Deterministic

Positive words - negative words

0.57

Word count

Document-term matrix (DTM)

0.87

TF-IDF

Weighted (DTM)

0.87 (0.88 recall)

Topic model (NMF)

Topic loadings (10 dimensions)

0.76

Word2vec

Average word vector (200-d)

0.84

LSTM (5 epoch)

LSTM output for 200 tokens (200, 512)

0.82

49 of 62

Issues with recurrent neural networks

  • Long training time
    • Sequence models hard to parallelize, each step dependent on previous
  • Issues of “forgetting” with long passages
    • Language has long-term dependencies
    • LSTM, Bi-directional LSTM don’t necessarily solve this

50 of 62

Long-term dependencies

Ben watched a movie today, he did not like it.

To what does “he” refer?

To what does “it” refer?

51 of 62

Long-term dependencies - grammar

Ben watched a movie today, he did not like it.

52 of 62

Long-term dependencies - “attention”

Ben watched a movie today, he did not like it.

Visual of attention weight between tokens

53 of 62

Transformer models: Attention is all you need!

  • Encoder: Translates from input to “encoded” space
    • View over entire sequence
  • Decoder: Translates from encoded to output
    • Encoder output + previous decoder output
  • Attention incorporated throughout
  • Remove need for “recurrence”
    • Sequence position as a “positional encoding”

54 of 62

Source task: Predicting a word from context

I ___ the Patriots.

What should fill in the blank?

55 of 62

Source task: Predicting a word from context

I ___ the Patriots, I want them to win.

What should fill in the blank?

I ___ the Patriots, I want them to lose.

What about here?

56 of 62

Bi-directional Encoder Representations from Transformers (BERT)

  • Transformer Language Model
    • Encoder+Decoder
    • Trained to predict next token
    • Output product of encoder + previous output
  • BERT
    • Encoder-only
    • Trained to predict masked/replaced token
    • Each output is a product of the entire sequence

57 of 62

To the notebooks - BERT

58 of 62

Sentiment analysis - our progress so far

Approach

Representation

F1 score

Deterministic

Positive words - negative words

0.57

Word count

Document-term matrix (DTM)

0.87

TF-IDF

Weighted (DTM)

0.87 (0.88 recall)

Topic model (NMF)

Topic loadings (10 dimensions)

0.76

Word2vec

Average word vector (200-d)

0.84

LSTM (5 epoch)

LSTM output for 200 tokens (200, 512)

0.82

BERT

BERT output for 512 tokens (512, 768)

0.84

59 of 62

My advice: Start simple, add complexity

  • Method for creating informative representation
    • Word counts, weighted word counts (TF-IDF)
      • Experiment with vocabulary and weights
    • Word embeddings
      • Experiment with sources, aggregations
    • Contextualized word embeddings
      • Try hand-curation (e.g. next-word embedding)
      • Bring in big guns (e.g. BERT)
  • Method for utilizing that informative representation for application
    • Corpus statistics (e.g. log-likelihood of words)
    • Similarity between words or documents (e.g. cosine similarity)
    • Classifier (e.g. regression)
    • Sequence tagging (e.g. named-entity recognition)
    • Language generation (predict next word)

60 of 62

Thank you for coming!

Some additional materials

Get in touch!

https://benbatorsky.com/

Twitter: @bpben2

Github: bpben

https://ai.northeastern.edu/jobs/

If you’d like to work with the Institute:

https://ai.northeastern.edu/contact-us/

61 of 62

Appendix

62 of 62

The internals of self attention