1 of 19

Introduction to Speech & Natural Language Processing

Lecture 5

Lexical Processing Applications

Krishnendu Ghosh

2 of 19

Authorship Attribution / Stylometry

Uses lexical features like word frequency, function-word usage, average sentence length, vocabulary richness, etc.

Helps identify or verify the author of a text (e.g., literature, fake news detection).

Example: Distinguishing Shakespeare from Marlowe using lexical statistics.

3 of 19

Authorship Attribution / Stylometry

Literary and historical analysis

Forensic analysis

Plagiarism detection

Combating misinformation

4 of 19

Language Identification

Uses token patterns, character n-grams, and word frequency distributions to determine the language of a text snippet.

Example: Distinguishing Kannada vs. Hindi vs. English tweets.

5 of 19

Language Identification

Automatic Translation

Content Filtering

Information Retrieval

Speech Recognition

6 of 19

Spell Checking and Correction

Relies on tokenization and edit distance algorithms to suggest possible corrections.

Example: Correcting “recieve” → “receive” using minimum edit distance.

7 of 19

Spell Checking and Correction

Word Processors

Content Creation

Search Engines

Learning Aid

8 of 19

Text Normalization for SE

Converts user queries and documents into comparable lexical forms (e.g., stemming, lemmatization).

Example: Searching “running shoes” returns results for “run” and “runners”.

9 of 19

Keyword Extraction and Indexing

Identifies most frequent or most informative tokens (TF-IDF, RAKE, etc.).

Example: Auto-generating tags for research papers or news articles.

10 of 19

Keyword Extraction and Indexing

Content Summarization

Document Classification and Categorization

Customer Feedback Analysis

Business Intelligence (BI)

Academic and Research

Fraud Detection and Security

11 of 19

Sentiment Analysis (Lexicon-Based)

Uses predefined lexical resources (e.g., positive/negative word lists) to infer sentiment polarity.

Example: Counting words like “happy”, “great”, “terrible” to score movie reviews.

12 of 19

Sentiment Analysis (Lexicon-Based)

Customer Feedback Analysis

Market Research

Crisis Management

Behavioral Research

Public Health Monitoring

Content Filtering/Moderation

Accessibility in Low-Resource Languages

13 of 19

IR and Search Optimization

Lexical preprocessing (stopword removal, case folding, stemming) improves precision and recall in document retrieval.

Example: Google search uses normalized tokens for query expansion.

14 of 19

Plagiarism Detection

Measures lexical overlap, n-gram similarity, and paraphrasing patterns across documents.

Example: Detecting copied or rephrased content using token similarity.

15 of 19