1 of 87

CS 6120

Lecture 5: Introduction to Modeling Language

2 of 87

Administrative

  • Today: Homework Volunteer and Voluntelling Students�
  • Reading: 52 Students have signed up. Next week marks the first week.�
  • Homework 3 is due today! Homework 4 is assigned�

3 of 87

Homework 2

4 of 87

Did You Implement a Trie in Homework 1?

  • We did it at a character level
  • Based on the frequency of words
  • At the word level: we’d do the same
    • Probabilistically predict the next word by ordering the most frequently occurring words next
    • Use efficient data structures to store patterns

Operation

Time Complexity

Auxiliary Space

Insertion

O(n)

O(n*m)

Searching

O(n)

O(1)

5 of 87

Some examples of language models

6 of 87

Language Models

  • Language Models - A machine learning model that uses statistics and probabilities to predict the likelihood of words or sequences of words appearing in a sentence or phrase.���������
  • Precursor to “Large” Language Models

Autocomplete: LM that predicts most likely word next

Automatic speech recognition (ASR): LM major role in ASR

Machine translation: LM provides accurate translations

Information Retrieval: LM is built for each document, & document with highest probability for a query is ranked highest

Text generation: LM generates original text, e.g., emails / blogs

Question answering: LM answers questions

7 of 87

Today’s Lecture

The N-Gram Model: The Simplest Language Model

8 of 87

Applications of N-Gram Modeling (Today’s Lecture)

  • Speech Recognition���
  • Spelling Correction���
  • Augmenting computing

Support physically disabled persons

P(“eyes awe of an”) � < P(“I saw a van”)

P(“he entered the ship to buy it”) �< P(“he entered the shop to buy it”)

9 of 87

Learning Objectives

  • Create a basic language model (LM) from a text corpus to
    • Estimate the probability of word sequences
    • Estimate the probability of a word following a sequence of words
  • Apply the concept to autocomplete a sentence with most likely suggestions

10 of 87

Learning Objectives

  • Process Text Corpus to N-Gram Language Model�
  • Out of Vocabulary (OOV) Words�
  • Smoothing for unseen N-Grams�
  • Language Model Evaluation (Perplexity)

Sentence auto-complete

Today and in Homework:

We’ll write a program that will write text on its own!

11 of 87

Modeling Language

Section 0: Administrivia

Section 1: Introduction to N-Grams

Section 2: Special Characters

Section 3: Smoothing

Section 4: Analysis and Evaluation

12 of 87

What is a language model?

  • A language model is a probabilistic representation of a sequence of words��

Language Model

“hello world”

“You are falling asleep…”

Represented by the joint probability of seeing the words in the order in which they appear:

How likely is a sequence of words to appear?

13 of 87

Section 2: Introduction to the Markov Chain

  • Some Terminology�
  • Maximum Likelihood Estimates Isolated N-Grams�
  • Probabilities of Sentences Using N-Grams MLE’s

  • Generating Text with N-Gram Models

14 of 87

Some notation

Notation for word ordering���so

and

15 of 87

Applying the Chain Rule

“The cat sat on the mat after going to the litter box and sniffing around.”

Recall Bayes Rule

16 of 87

n-gram Modeling

17 of 87

Uni-Grams / Bi-Grams / Tri-Grams

18 of 87

Section 2: Introduction to the Markov Chain

  • Some Terminology�
  • Maximum Likelihood Estimates Isolated N-Grams�
  • Probabilities of Sentences Using N-Grams MLE’s

  • Generating Text with N-Gram Models

19 of 87

Unigram Estimates of Single Word Probabilities

  • Corpus:�Natural language processing is an amazing class. Processing is fun right?�
  • Current size of the corpus is m = 11�
  • What’s the unigram definition? �
  • How would you calculate the unigram?�
  • What is it for the word “amazing”?

20 of 87

Bigram Estimates of Word Pair Probabilities

  • Corpus:�Natural language processing is an amazing class. Processing is fun right?�
  • Current size of the corpus is m = 11�
  • What’s the bigram definition? �
  • How would you calculate the unigram?�
  • What is the bigram probability for “processing is”?

21 of 87

Conditional Probability Distributions of Word Models

22 of 87

The Combinatorics of the N-Gram Model

word 1 in {W}

word 2 in {W}

word n in {W}

23 of 87

Estimating the Probabilities

24 of 87

Section 2: Introduction to the Markov Chain

  • Some Terminology�
  • Maximum Likelihood Estimates Isolated N-Grams
  • Probabilities of Sentences Using N-Grams MLE’s

  • Generating Text with N-Gram Models

25 of 87

The N-Gram Model

26 of 87

The Markov Assumption

27 of 87

What does this mean (computationally)

Probability of seeing any word given the last n words is the same as if you had seen all of the words. (It is independent of any word seen earlier than the last nth word).

nth-Order Markov Assumption

Only need to estimate based on n prior words

28 of 87

Sentence Probability w/ML Estimates under Bigram Markov Assumption

  • Without any seeds, how would we predict words with a bigram?���
  • How would we start?

  • Prob of the first word: �
  • Prob of the second word:�
  • Prob of the third word:

29 of 87

Summary

��

  • “The teacher drinks tea”��
  • Using the Markov assumption:���
  • Modeling sentences

30 of 87

Building the Joint Probability (Example)

“I know it is wet and it is not sunny, but it is lots of good fun that I know is funny.”

31 of 87

Section 2: Introduction to the Markov Chain

  • Some Terminology�
  • Maximum Likelihood Estimates Isolated N-Grams�
  • Probabilities of Sentences Using N-Grams MLE’s

  • Generating Text with N-Gram Models

32 of 87

Sentence Probability w/ML Estimates under Bigram Markov Assumption

  • Without any seeds, how would we predict words with a bigram?���
  • How would we start?

  • Generate the first word: �
  • Generate the second word:�
  • Generate the third word:

33 of 87

Or, Complete Sentences (e.g., with 4-Grams)

Given

Generate using �� Generate using �� Generate using �� Generate using

34 of 87

Generation Process

  • Greedy Algorithm: Just pick the highest probability word every time���

35 of 87

Some Generated Text

36 of 87

Limitations

Main limitation: cannot satisfy long-term dependencies.��Rachel/Phillip wanted to drive to the stadium, but the police officer told her/him to head over to the house.

This is because n is low, i.e., the context that the model can process is limited. For example n < 4.

Modern LM’s have much longer contexts

37 of 87

An Example of Modern LM: GPT-2

38 of 87

N-Grams Pre-Processing

39 of 87

Modeling Language

Section 0: Administrivia

Section 1: Introduction to N-Grams

Section 2: Special Characters

Section 3: Smoothing

Section 4: Analysis and Evaluation

40 of 87

Definitions on Your Vocabulary

  • Closed Vocabulary Inference - In some tasks like speech recognition or question answering, you will encounter and generate words only from a fixed set of words. Hence, a closed vocabulary.

  • Open Vocabulary Inference - You might encounter words from outside the vocabulary in the training set, like a name of a new city.

41 of 87

Starting / Ending of Sentences

42 of 87

Criteria for Creating Vocabulary

  • Min Word Frequency f�
  • Max |V|, include words by frequency�
  • Use <unk> sparingly�
  • Perplexity - only compare LMs with the same V

43 of 87

N-Grams: What’s Out of Vocabulary?

44 of 87

Modeling Language

Section 0: Administrivia

Section 1: Introduction to N-Grams

Section 2: Special Characters

Section 3: Smoothing

Section 4: Analysis and Evaluation

45 of 87

Language Model Probability Distributions

  • Word models generally follow the Zipfian distribution (most corpora)������������
  • Long tail of infrequent words – when estimating, some will be zero probability!

46 of 87

Language Model Probability Distributions

  • Word models generally follow the Zipfian distribution (most corpora)������������
  • Long tail of infrequent words – when estimating, some will be zero probability!

47 of 87

Generalization of N-Grams

48 of 87

Smoothing

49 of 87

Smoothing: Intuition

  • How to remedy zero probability estimates

50 of 87

Section 3: Smoothing

  • Additive Smoothing (Laplace / Alpha and Add One)�
  • Interpolation�
  • Discounting & Backoff

51 of 87

Laplacian Smoothing

52 of 87

Raw Counts: Bigrams

53 of 87

Smoothed Counts: Bigrams

54 of 87

Smoothed Probabilities: Bigrams

55 of 87

Section 3: Smoothing

  • Additive Smoothing (Laplace / Alpha and Add One)�
  • Interpolation�
  • Discounting & Backoff

56 of 87

Linear Interpolation

57 of 87

Linear Interpolation: Choosing Lambdas

58 of 87

Section 3: Smoothing

  • Additive Smoothing (Laplace / Alpha and Add One)�
  • Interpolation�
  • Discounting & Backoff

59 of 87

Discounting

60 of 87

Absolute Discounting

61 of 87

Absolute Discounting

62 of 87

Section 3: Smoothing

  • Additive Smoothing (Laplace / Alpha and Add One)�
  • Interpolation�
  • Discounting & Backoff

63 of 87

Quiz

Corpus: “I am happy I am learning”

In the context of our corpus, what is the estimated probability of word “can” following the word “I” using the bigram model and add-α-smoothing where k=3.

  1. P(can|I) = 0
  2. P(can|I) =1
  3. P(can|I) = 3/(2+3*4)
  4. P(can|I) = 3/(3*4)

64 of 87

Laboratory 5.3 - Building the Model

65 of 87

Modeling Language

Section 0: Administrivia

Section 1: Introduction to N-Grams

Section 2: Special Characters

Section 3: Smoothing

Section 4: Analysis and Evaluation

66 of 87

Splitting the Data

67 of 87

Perplexity (ppl)

  • Measure of how well a Language Model predicts the next word�
  • For a test corpus with words w1, w2, …, wn:

  • Unigram model

  • Minimizing perplexity ~ maximizing probability of corpus

68 of 87

Perplexity: Intuition

69 of 87

Perplexity: Intuition

70 of 87

The Perplexity of Models

Training: 38M Words

Evaluation: 1.5M Words

Wall Street Journal

71 of 87

72 of 87

Graveyard

73 of 87

Modeling Language

Section 0: Administrivia

Section 1: What does it mean to model language?

Section 2: Introduction to modeling sequences

Section 3: Analysis and evaluation

74 of 87

Example of modeling language sequence

PRP: Personal Pronoun

VBZ: Verb, singular, present tense, 3rd person

NNS: Plural noun

IN: Preposition / subordinating conjunction

DT: Determiner

NN: Singular noun

75 of 87

Using sequential models in parts of speech tagging

Named entity recognition - use parts of speech tagging to extract dates, proper nouns / entities, qualifying descriptors, associating entities with each other

76 of 87

Common applications in graph analytics

Co-occurrence of named entities within documents�

Applications:

�* Law Enforcement

* E-Com Analytics

* Search Algos

77 of 87

Agnosticism to Order

78 of 87

Multi-functional words

79 of 87

Section 2: Introduction to the Markov Chain

  • Why model sequences?�
  • Introduction to the Markov Chain

  • Introduction to the Hidden Markov Model�
  • Decoding with the Hidden Markov Model

80 of 87

What is sequential modeling?

  • We’ve just described sequential modeling�����
  • Sequential modeling: modeling what comes next!
    • Given: an observation of text
    • Predict: what text comes next�
  • All language models are sequential models
    • Can predict a single word
    • Can predict a sequence of words

81 of 87

Modern sequential modeling

  • Simple probabilistic models��
  • The Hidden Markov Model��
  • Deep Neural Network
    • Convolutional Neural Networks
    • Recurrent Neural Network
    • Transformer Neural Networks
    • Large Language Models (LLMs)
    • LLMs trained with RLHF

(For later …

… the majority of the class)

Today…as an introduction

(actually more difficult than DNNs)

We’ve just discussed

82 of 87

Markov Chains

  • Initial probability of the initial state

  • There are transitional probabilities:�
  • Markov Assumption:

s1

s2

s3

s4

s5

83 of 87

Markov Chains

  • Models probability of sequences.

  • Each state can be one of K values (assuming , for simplicity )�
  • Markov Chain is entirely specified by:

s1

s2

s3

s4

s5

84 of 87

Markov Chains

  • MC: Can model many things, assumes the emitted variable is entirely visible

85 of 87

Section 2: Introduction to the Hidden Markov Model

  • Why model sequences?�
  • Introduction to the Markov Chain

  • Introduction to the Hidden Markov Model
  • Decoding with the Hidden Markov Model

86 of 87

Do we use HMMs for language modeling?

Typically no.

  • Usually models smaller tasks like parts of speech tagging or named entity recognition, not necessarily the language itself. �
  • It typically does not model the words themselves.�
  • While HMMs can be used → simplistic compared to modern techniques like:
    • Recurrent Neural Networks (RNNs),
    • Long Short-Term Memory (LSTM) networks,
    • Transformer models

87 of 87

Modeling Language

Section 0: Administrivia

Section 1: What does it mean to model language?

Section 2: Introduction to Sequential Modeling

Section 3: Analysis and Evaluation