1 of 8

CS458 Natural language Processing

Self-study 4

Language Modelling

Krishnendu Ghosh

Department of Computer Science & Engineering

Indian Institute of Information Technology Dharwad

2 of 8

Language Modelling

1. Tokenization

Before creating a language model, tokenize the text into sentences or words.

import nltk

from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt')

text = "NLP is exciting. It helps in language understanding!"

sentences = sent_tokenize(text)

tokens = word_tokenize(text)

print("Sentences:", sentences)

print("Tokens:", tokens)

3 of 8

Language Modelling

2. Creating an N-gram Model

An N-gram model predicts the next word based on the previous N−1N-1N−1 words. NLTK provides tools for generating and analyzing N-grams.

Generate N-grams:

from nltk.util import ngrams

from collections import Counter

tokens = word_tokenize("I love NLP and NLP loves me.")

bigrams = list(ngrams(tokens, 2))

print("Bigrams:", bigrams)

# Count frequencies

bigram_freq = Counter(bigrams)

print("Bigram Frequencies:", bigram_freq)

4 of 8

Language Modelling

3. Train a Simple Language Model

A simple N-gram model can use probabilities estimated from corpus frequencies.

Example:

# Sample corpus

corpus = "I love NLP. NLP loves me. NLP is fun."

tokens = word_tokenize(corpus)

bigrams = list(ngrams(tokens, 2))

# Frequency distribution

bigram_freq = Counter(bigrams)

unigram_freq = Counter(tokens)

# Calculate probabilities

bigram_prob = {bigram: freq / unigram_freq[bigram[0]] for bigram, freq in bigram_freq.items()}

print("Bigram Probabilities:", bigram_prob)

5 of 8

Language Modelling

4. Generate Text Using the N-gram Model

Use the trained N-gram model to generate text probabilistically.

import random

def generate_text(start_word, bigram_prob, num_words=5):

current_word = start_word

text = [current_word]

for _ in range(num_words - 1):

# Find possible next words

candidates = {k[1]: v for k, v in bigram_prob.items() if k[0] == current_word}

if not candidates:

break

# Choose next word based on probability

current_word = random.choices(list(candidates.keys()), weights=candidates.values())[0]

text.append(current_word)

return ' '.join(text)

# Generate text

start_word = "NLP"

generated_text = generate_text(start_word, bigram_prob)

print("Generated Text:", generated_text)

6 of 8

Language Modelling

5. Smoothing

Handle unseen N-grams using Laplace Smoothing:

vocab_size = len(set(tokens))

bigram_prob_smoothed = {bigram: (freq + 1) / (unigram_freq[bigram[0]] + vocab_size)

for bigram, freq in bigram_freq.items()}

print("Smoothed Bigram Probabilities:", bigram_prob_smoothed)

7 of 8

Language Modelling

6. Evaluate the Language Model

Evaluate using Perplexity, which measures how well a language model predicts a sample.

Perplexity:

import math

def calculate_perplexity(test_corpus, bigram_prob):

tokens = word_tokenize(test_corpus)

bigrams = list(ngrams(tokens, 2))

perplexity = 1

N = len(bigrams)

for bigram in bigrams:

prob = bigram_prob.get(bigram, 1e-6) # Small value for unseen bigrams

perplexity *= 1 / prob

return math.pow(perplexity, 1 / N)

test_corpus = "NLP is fun and exciting."

perplexity = calculate_perplexity(test_corpus, bigram_prob_smoothed)

print("Perplexity:", perplexity)

8 of 8

Thank You