1 of 87

CS 6120

Lecture 5: Introduction to Modeling Language

2 of 87

Administrative

Today: Homework Volunteer and Voluntelling Students�
Reading: 52 Students have signed up. Next week marks the first week.�
Homework 3 is due today! Homework 4 is assigned�

3 of 87

Homework 2

In the Channel: http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/calculus.html �
Some matrix and vector derivatives

4 of 87

Did You Implement a Trie in Homework 1?

We did it at a character level
Based on the frequency of words
At the word level: we’d do the same

Probabilistically predict the next word by ordering the most frequently occurring words next
Use efficient data structures to store patterns

Operation	Time Complexity	Auxiliary Space
Insertion	O(n)	O(n*m)
Searching	O(n)	O(1)

5 of 87

Some examples of language models

6 of 87

Language Models

Language Models - A machine learning model that uses statistics and probabilities to predict the likelihood of words or sequences of words appearing in a sentence or phrase.��
Precursor to “Large” Language Models

Autocomplete: LM that predicts most likely word next

Automatic speech recognition (ASR): LM major role in ASR

Machine translation: LM provides accurate translations

Information Retrieval: LM is built for each document, & document with highest probability for a query is ranked highest

Text generation: LM generates original text, e.g., emails / blogs

Question answering: LM answers questions

7 of 87

Today’s Lecture

The N-Gram Model: The Simplest Language Model

8 of 87

Applications of N-Gram Modeling (Today’s Lecture)

Speech Recognition��
Spelling Correction��
Augmenting computing

Support physically disabled persons

P(“eyes awe of an”) � < P(“I saw a van”)

P(“he entered the ship to buy it”) �< P(“he entered the shop to buy it”)

N-grams are fundamental and give you a foundation that will allow you to understand more complicated models in the specialization. These models allow you to calculate probabilities of certain words happening in a specific sequence. Using that, you can build an auto-correct or even a search suggestion tool.

While you are going to focus on the autocomplete application this week, language models are widely used in other areas as well. Language models are used in speech recognition to convert the outputs into real words. For example, if an acoustic model converts speech to text, and it's hears, "I saw a van," then the speech to text system can use a language model to know that it's more likely the sentence was, "I saw a van" and not "eyes awe of an." Language models are also used in spelling correction to identify words that should be replaced based on the sentence there in. Probabilities are also important for augmentative communication systems. These are systems that can take a series of hand gestures from a user to help them form words and sentences. People like Stephen Hawking, who are unable to physically talk or sign, can instead use simple movements to select towards from a menu and then the system can speak for them. Word prediction using language models can be used to suggest likely words for the menu.

(Newell et al)

9 of 87

Learning Objectives

Create a basic language model (LM) from a text corpus to

Estimate the probability of word sequences
Estimate the probability of a word following a sequence of words

Apply the concept to autocomplete a sentence with most likely suggestions

10 of 87

Learning Objectives

Process Text Corpus to N-Gram Language Model�
Out of Vocabulary (OOV) Words�
Smoothing for unseen N-Grams�
Language Model Evaluation (Perplexity)

Sentence auto-complete

Today and in Homework:

We’ll write a program that will write text on its own!

Here's what you'll be doing this week. First, you will transform your raw text corpus into a language model, which returns the probability of the next word by using the previous words of a sentence. Next, you'll adapt your language model to deal with words the model hasn't seen during training. These words are called out of vocabulary words. Smoothing is another technique that you can use to deal with previously unseen inputs. Probability of unseen words and sentences can still be successfully estimated this way. Smoothing essentially pretends that each word and phrase appears in the training corpus at least once. This helps to calculate the probability even for unusual words and sequences. Finally, I'll show you how to choose the best language model with the perplexity metric, a new tool for your toolkits. When you combine these skills, you'll be able to successfully implement a sentence auto-completion model in this week's assignments. Now that you have an overview of what you're going to learn, we will dive in and start by showing you n-gram probabilities in the next video.

11 of 87

Modeling Language�

Section 0: Administrivia

Section 1: Introduction to N-Grams

Section 2: Special Characters

Section 3: Smoothing

Section 4: Analysis and Evaluation

12 of 87

What is a language model?

A language model is a probabilistic representation of a sequence of words��

Language Model

“hello world”

“You are falling asleep…”

Represented by the joint probability of seeing the words in the order in which they appear:

How likely is a sequence of words to appear?

13 of 87

Section 2: Introduction to the Markov Chain

Some Terminology�
Maximum Likelihood Estimates Isolated N-Grams�
Probabilities of Sentences Using N-Grams MLE’s

Generating Text with N-Gram Models

14 of 87

Some notation

Notation for word ordering��so

and

15 of 87

Applying the Chain Rule

“The cat sat on the mat after going to the litter box and sniffing around.”

Recall Bayes Rule

16 of 87

n-gram Modeling

17 of 87

Uni-Grams / Bi-Grams / Tri-Grams

18 of 87

Section 2: Introduction to the Markov Chain

Some Terminology�
Maximum Likelihood Estimates Isolated N-Grams�
Probabilities of Sentences Using N-Grams MLE’s

Generating Text with N-Gram Models

19 of 87

Unigram Estimates of Single Word Probabilities

Corpus:�Natural language processing is an amazing class. Processing is fun right?�
Current size of the corpus is m = 11�
What’s the unigram definition? �
How would you calculate the unigram?�
What is it for the word “amazing”?

20 of 87

Bigram Estimates of Word Pair Probabilities

Corpus:�Natural language processing is an amazing class. Processing is fun right?�
Current size of the corpus is m = 11�
What’s the bigram definition? �
How would you calculate the unigram?�
What is the bigram probability for “processing is”?

21 of 87

Conditional Probability Distributions of Word Models

22 of 87

The Combinatorics of the N-Gram Model

word 1 in {W}

word 2 in {W}

…

word n in {W}

23 of 87

Estimating the Probabilities

24 of 87

Section 2: Introduction to the Markov Chain

Some Terminology�
Maximum Likelihood Estimates Isolated N-Grams�
Probabilities of Sentences Using N-Grams MLE’s

Generating Text with N-Gram Models

25 of 87

The N-Gram Model

26 of 87

The Markov Assumption

27 of 87

What does this mean (computationally)

Probability of seeing any word given the last n words is the same as if you had seen all of the words. (It is independent of any word seen earlier than the last nth word).

nth-Order Markov Assumption

Only need to estimate based on n prior words

28 of 87

Sentence Probability w/ML Estimates under Bigram Markov Assumption

Without any seeds, how would we predict words with a bigram?��
How would we start?

Prob of the first word: �
Prob of the second word:�
Prob of the third word:

29 of 87

Summary

��

“The teacher drinks tea”��
Using the Markov assumption:��
Modeling sentences

30 of 87

Building the Joint Probability (Example)

“I know it is wet and it is not sunny, but it is lots of good fun that I know is funny.”

31 of 87

Section 2: Introduction to the Markov Chain

Some Terminology�
Maximum Likelihood Estimates Isolated N-Grams�
Probabilities of Sentences Using N-Grams MLE’s

Generating Text with N-Gram Models

32 of 87

Sentence Probability w/ML Estimates under Bigram Markov Assumption

Without any seeds, how would we predict words with a bigram?��
How would we start?

Generate the first word: �
Generate the second word:�
Generate the third word:

33 of 87

Or, Complete Sentences (e.g., with 4-Grams)

Given

Generate using �� Generate using �� Generate using �� Generate using

34 of 87

Generation Process

Greedy Algorithm: Just pick the highest probability word every time��

35 of 87

Some Generated Text

36 of 87

Limitations

Main limitation: cannot satisfy long-term dependencies.��Rachel/Phillip wanted to drive to the stadium, but the police officer told her/him to head over to the house.

This is because n is low, i.e., the context that the model can process is limited. For example n < 4.

Modern LM’s have much longer contexts

37 of 87

An Example of Modern LM: GPT-2

38 of 87

N-Grams Pre-Processing

39 of 87

Modeling Language�

Section 0: Administrivia

Section 1: Introduction to N-Grams

Section 2: Special Characters

Section 3: Smoothing

Section 4: Analysis and Evaluation

40 of 87

Definitions on Your Vocabulary

Closed Vocabulary Inference - In some tasks like speech recognition or question answering, you will encounter and generate words only from a fixed set of words. Hence, a closed vocabulary.

Open Vocabulary Inference - You might encounter words from outside the vocabulary in the training set, like a name of a new city.

41 of 87

Starting / Ending of Sentences

42 of 87

Criteria for Creating Vocabulary

Min Word Frequency f�
Max |V|, include words by frequency�
Use <unk> sparingly�
Perplexity - only compare LMs with the same V

43 of 87

N-Grams: What’s Out of Vocabulary?

44 of 87

Modeling Language�

Section 0: Administrivia

Section 1: Introduction to N-Grams

Section 2: Special Characters

Section 3: Smoothing

Section 4: Analysis and Evaluation

45 of 87

Language Model Probability Distributions

Word models generally follow the Zipfian distribution (most corpora)��
Long tail of infrequent words – when estimating, some will be zero probability!

46 of 87

Language Model Probability Distributions

Word models generally follow the Zipfian distribution (most corpora)��
Long tail of infrequent words – when estimating, some will be zero probability!

47 of 87

Generalization of N-Grams

48 of 87

Smoothing

49 of 87

Smoothing: Intuition

How to remedy zero probability estimates

50 of 87

Section 3: Smoothing

Additive Smoothing (Laplace / Alpha and Add One)�
Interpolation�
Discounting & Backoff

51 of 87

Laplacian Smoothing

52 of 87

Raw Counts: Bigrams

53 of 87

Smoothed Counts: Bigrams

54 of 87

Smoothed Probabilities: Bigrams

55 of 87

Section 3: Smoothing

Additive Smoothing (Laplace / Alpha and Add One)�
Interpolation�
Discounting & Backoff

56 of 87

Linear Interpolation

57 of 87

Linear Interpolation: Choosing Lambdas

58 of 87

Section 3: Smoothing

Additive Smoothing (Laplace / Alpha and Add One)�
Interpolation�
Discounting & Backoff

59 of 87

Discounting

60 of 87

Absolute Discounting

61 of 87

Absolute Discounting

62 of 87

Section 3: Smoothing

Additive Smoothing (Laplace / Alpha and Add One)�
Interpolation�
Discounting & Backoff

63 of 87

Quiz

Corpus: “I am happy I am learning”

In the context of our corpus, what is the estimated probability of word “can” following the word “I” using the bigram model and add-α-smoothing where k=3.

P(can|I) = 0
P(can|I) =1
P(can|I) = 3/(2+3*4)
P(can|I) = 3/(3*4)

64 of 87

Laboratory 5.3 - Building the Model

65 of 87

Modeling Language�

Section 0: Administrivia

Section 1: Introduction to N-Grams

Section 2: Special Characters

Section 3: Smoothing

Section 4: Analysis and Evaluation

66 of 87

Splitting the Data

67 of 87

Perplexity (ppl)

Measure of how well a Language Model predicts the next word�
For a test corpus with words w₁, w₂, …, w_n:

�

Unigram model

Minimizing perplexity ~ maximizing probability of corpus

68 of 87

Perplexity: Intuition

69 of 87

Perplexity: Intuition

70 of 87

The Perplexity of Models

Training: 38M Words

Evaluation: 1.5M Words

Wall Street Journal

71 of 87

72 of 87

Graveyard

73 of 87

Modeling Language�

Section 0: Administrivia

Section 1: What does it mean to model language?

Section 2: Introduction to modeling sequences

Section 3: Analysis and evaluation

74 of 87

Example of modeling language sequence

PRP: Personal Pronoun

VBZ: Verb, singular, present tense, 3rd person

NNS: Plural noun

IN: Preposition / subordinating conjunction

DT: Determiner

NN: Singular noun

75 of 87

Using sequential models in parts of speech tagging

Named entity recognition - use parts of speech tagging to extract dates, proper nouns / entities, qualifying descriptors, associating entities with each other

76 of 87

Common applications in graph analytics

Co-occurrence of named entities within documents�

Applications:

�* Law Enforcement

* E-Com Analytics

* Search Algos

77 of 87

Agnosticism to Order

78 of 87

Multi-functional words

79 of 87

Section 2: Introduction to the Markov Chain

Why model sequences?�
Introduction to the Markov Chain

Introduction to the Hidden Markov Model�
Decoding with the Hidden Markov Model

80 of 87

What is sequential modeling?

We’ve just described sequential modeling��
Sequential modeling: modeling what comes next!

Given: an observation of text
Predict: what text comes next�

All language models are sequential models

Can predict a single word
Can predict a sequence of words

81 of 87

Modern sequential modeling

Simple probabilistic models��
The Hidden Markov Model��
Deep Neural Network

Convolutional Neural Networks
Recurrent Neural Network
Transformer Neural Networks
Large Language Models (LLMs)
LLMs trained with RLHF

(For later …

… the majority of the class)

Today…as an introduction

(actually more difficult than DNNs)

We’ve just discussed

82 of 87

Markov Chains

Initial probability of the initial state

There are transitional probabilities:�
Markov Assumption:

s₁

s₂

s₃

s₄

s₅

83 of 87

Markov Chains

Models probability of sequences.

Each state can be one of K values (assuming , for simplicity )�
Markov Chain is entirely specified by:

s₁

s₂

s₃

s₄

s₅

84 of 87

Markov Chains

MC: Can model many things, assumes the emitted variable is entirely visible

85 of 87

Section 2: Introduction to the Hidden Markov Model

Why model sequences?�
Introduction to the Markov Chain

Introduction to the Hidden Markov Model�
Decoding with the Hidden Markov Model

86 of 87

Do we use HMMs for language modeling?

Typically no.

Usually models smaller tasks like parts of speech tagging or named entity recognition, not necessarily the language itself. �
It typically does not model the words themselves.�
While HMMs can be used → simplistic compared to modern techniques like:

Recurrent Neural Networks (RNNs),
Long Short-Term Memory (LSTM) networks,
Transformer models

87 of 87

Modeling Language�

Section 0: Administrivia

Section 1: What does it mean to model language?

Section 2: Introduction to Sequential Modeling

Section 3: Analysis and Evaluation