CS458 Natural language Processing
Self-study 4
Language Modelling
Krishnendu Ghosh
Department of Computer Science & Engineering
Indian Institute of Information Technology Dharwad
Language Modelling
1. Tokenization
Before creating a language model, tokenize the text into sentences or words.
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
text = "NLP is exciting. It helps in language understanding!"
sentences = sent_tokenize(text)
tokens = word_tokenize(text)
print("Sentences:", sentences)
print("Tokens:", tokens)
Language Modelling
2. Creating an N-gram Model
An N-gram model predicts the next word based on the previous N−1N-1N−1 words. NLTK provides tools for generating and analyzing N-grams.
Generate N-grams:
from nltk.util import ngrams
from collections import Counter
tokens = word_tokenize("I love NLP and NLP loves me.")
bigrams = list(ngrams(tokens, 2))
print("Bigrams:", bigrams)
# Count frequencies
bigram_freq = Counter(bigrams)
print("Bigram Frequencies:", bigram_freq)
Language Modelling
3. Train a Simple Language Model
A simple N-gram model can use probabilities estimated from corpus frequencies.
Example:
# Sample corpus
corpus = "I love NLP. NLP loves me. NLP is fun."
tokens = word_tokenize(corpus)
bigrams = list(ngrams(tokens, 2))
# Frequency distribution
bigram_freq = Counter(bigrams)
unigram_freq = Counter(tokens)
# Calculate probabilities
bigram_prob = {bigram: freq / unigram_freq[bigram[0]] for bigram, freq in bigram_freq.items()}
print("Bigram Probabilities:", bigram_prob)
Language Modelling
4. Generate Text Using the N-gram Model
Use the trained N-gram model to generate text probabilistically.
import random
def generate_text(start_word, bigram_prob, num_words=5):
current_word = start_word
text = [current_word]
for _ in range(num_words - 1):
# Find possible next words
candidates = {k[1]: v for k, v in bigram_prob.items() if k[0] == current_word}
if not candidates:
break
# Choose next word based on probability
current_word = random.choices(list(candidates.keys()), weights=candidates.values())[0]
text.append(current_word)
return ' '.join(text)
# Generate text
start_word = "NLP"
generated_text = generate_text(start_word, bigram_prob)
print("Generated Text:", generated_text)
Language Modelling
5. Smoothing
Handle unseen N-grams using Laplace Smoothing:
vocab_size = len(set(tokens))
bigram_prob_smoothed = {bigram: (freq + 1) / (unigram_freq[bigram[0]] + vocab_size)
for bigram, freq in bigram_freq.items()}
print("Smoothed Bigram Probabilities:", bigram_prob_smoothed)
Language Modelling
6. Evaluate the Language Model
Evaluate using Perplexity, which measures how well a language model predicts a sample.
Perplexity:
import math
def calculate_perplexity(test_corpus, bigram_prob):
tokens = word_tokenize(test_corpus)
bigrams = list(ngrams(tokens, 2))
perplexity = 1
N = len(bigrams)
for bigram in bigrams:
prob = bigram_prob.get(bigram, 1e-6) # Small value for unseen bigrams
perplexity *= 1 / prob
return math.pow(perplexity, 1 / N)
test_corpus = "NLP is fun and exciting."
perplexity = calculate_perplexity(test_corpus, bigram_prob_smoothed)
print("Perplexity:", perplexity)
Thank You