CS 6120
Lecture 5: Introduction to Modeling Language
Administrative
Homework 2
Did You Implement a Trie in Homework 1?
Operation | Time Complexity | Auxiliary Space |
Insertion | O(n) | O(n*m) |
Searching | O(n) | O(1) |
Some examples of language models
Language Models
Autocomplete: LM that predicts most likely word next
Automatic speech recognition (ASR): LM major role in ASR
Machine translation: LM provides accurate translations
Information Retrieval: LM is built for each document, & document with highest probability for a query is ranked highest
Text generation: LM generates original text, e.g., emails / blogs
Question answering: LM answers questions
Today’s Lecture
The N-Gram Model: The Simplest Language Model
Applications of N-Gram Modeling (Today’s Lecture)
Support physically disabled persons
P(“eyes awe of an”) � < P(“I saw a van”)
P(“he entered the ship to buy it”) �< P(“he entered the shop to buy it”)
Learning Objectives
Learning Objectives
Sentence auto-complete
Today and in Homework:
We’ll write a program that will write text on its own!
Modeling Language�
Section 0: Administrivia
Section 1: Introduction to N-Grams
Section 2: Special Characters
Section 3: Smoothing
Section 4: Analysis and Evaluation
What is a language model?
Language Model
“hello world”
“You are falling asleep…”
Represented by the joint probability of seeing the words in the order in which they appear:
How likely is a sequence of words to appear?
Section 2: Introduction to the Markov Chain
Some notation
Notation for word ordering���so
and
Applying the Chain Rule
“The cat sat on the mat after going to the litter box and sniffing around.”
Recall Bayes Rule
n-gram Modeling
Uni-Grams / Bi-Grams / Tri-Grams
Section 2: Introduction to the Markov Chain
Unigram Estimates of Single Word Probabilities
Bigram Estimates of Word Pair Probabilities
Conditional Probability Distributions of Word Models
The Combinatorics of the N-Gram Model
word 1 in {W}
word 2 in {W}
…
…
word n in {W}
Estimating the Probabilities
Section 2: Introduction to the Markov Chain
The N-Gram Model
The Markov Assumption
What does this mean (computationally)
Probability of seeing any word given the last n words is the same as if you had seen all of the words. (It is independent of any word seen earlier than the last nth word).
nth-Order Markov Assumption
Only need to estimate based on n prior words
Sentence Probability w/ML Estimates under Bigram Markov Assumption
Summary
��
Building the Joint Probability (Example)
“I know it is wet and it is not sunny, but it is lots of good fun that I know is funny.”
Section 2: Introduction to the Markov Chain
Sentence Probability w/ML Estimates under Bigram Markov Assumption
Or, Complete Sentences (e.g., with 4-Grams)
Given
Generate using �� Generate using �� Generate using �� Generate using
Generation Process
Some Generated Text
Limitations
Main limitation: cannot satisfy long-term dependencies.��Rachel/Phillip wanted to drive to the stadium, but the police officer told her/him to head over to the house.
This is because n is low, i.e., the context that the model can process is limited. For example n < 4.
Modern LM’s have much longer contexts
An Example of Modern LM: GPT-2
N-Grams Pre-Processing
Modeling Language�
Section 0: Administrivia
Section 1: Introduction to N-Grams
Section 2: Special Characters
Section 3: Smoothing
Section 4: Analysis and Evaluation
Definitions on Your Vocabulary
Starting / Ending of Sentences
Criteria for Creating Vocabulary
N-Grams: What’s Out of Vocabulary?
Modeling Language�
Section 0: Administrivia
Section 1: Introduction to N-Grams
Section 2: Special Characters
Section 3: Smoothing
Section 4: Analysis and Evaluation
Language Model Probability Distributions
Language Model Probability Distributions
Generalization of N-Grams
Smoothing
Smoothing: Intuition
Section 3: Smoothing
Laplacian Smoothing
Raw Counts: Bigrams
Smoothed Counts: Bigrams
Smoothed Probabilities: Bigrams
Section 3: Smoothing
Linear Interpolation
Linear Interpolation: Choosing Lambdas
Section 3: Smoothing
Discounting
Absolute Discounting
Absolute Discounting
Section 3: Smoothing
Quiz
Corpus: “I am happy I am learning”
In the context of our corpus, what is the estimated probability of word “can” following the word “I” using the bigram model and add-α-smoothing where k=3.
Laboratory 5.3 - Building the Model
Modeling Language�
Section 0: Administrivia
Section 1: Introduction to N-Grams
Section 2: Special Characters
Section 3: Smoothing
Section 4: Analysis and Evaluation
Splitting the Data
Perplexity (ppl)
�
Perplexity: Intuition
Perplexity: Intuition
The Perplexity of Models
Training: 38M Words
Evaluation: 1.5M Words
Wall Street Journal
Graveyard
Modeling Language�
Section 0: Administrivia
Section 1: What does it mean to model language?
Section 2: Introduction to modeling sequences
Section 3: Analysis and evaluation
Example of modeling language sequence
PRP: Personal Pronoun
VBZ: Verb, singular, present tense, 3rd person
NNS: Plural noun
IN: Preposition / subordinating conjunction
DT: Determiner
NN: Singular noun
Using sequential models in parts of speech tagging
Named entity recognition - use parts of speech tagging to extract dates, proper nouns / entities, qualifying descriptors, associating entities with each other
Common applications in graph analytics
Co-occurrence of named entities within documents�
Applications:
�* Law Enforcement
* E-Com Analytics
* Search Algos
Agnosticism to Order
Multi-functional words
Section 2: Introduction to the Markov Chain
What is sequential modeling?
Modern sequential modeling
(For later …
… the majority of the class)
Today…as an introduction
(actually more difficult than DNNs)
We’ve just discussed
Markov Chains
s1
s2
s3
s4
s5
Markov Chains
s1
s2
s3
s4
s5
Markov Chains
Section 2: Introduction to the Hidden Markov Model
Do we use HMMs for language modeling?
Typically no.
Modeling Language�
Section 0: Administrivia
Section 1: What does it mean to model language?
Section 2: Introduction to Sequential Modeling
Section 3: Analysis and Evaluation