1 of 19

Introduction to Speech & Natural Language Processing

Lecture 5

Lexical Processing Applications

Krishnendu Ghosh

2 of 19

Authorship Attribution / Stylometry

  • Uses lexical features like word frequency, function-word usage, average sentence length, vocabulary richness, etc.

  • Helps identify or verify the author of a text (e.g., literature, fake news detection).

Example: Distinguishing Shakespeare from Marlowe using lexical statistics.

3 of 19

Authorship Attribution / Stylometry

  • Literary and historical analysis

  • Forensic analysis

  • Plagiarism detection

  • Combating misinformation

4 of 19

Language Identification

  • Uses token patterns, character n-grams, and word frequency distributions to determine the language of a text snippet.

Example: Distinguishing Kannada vs. Hindi vs. English tweets.

5 of 19

Language Identification

  • Automatic Translation

  • Content Filtering

  • Information Retrieval

  • Speech Recognition

6 of 19

Spell Checking and Correction

  • Relies on tokenization and edit distance algorithms to suggest possible corrections.

Example: Correcting “recieve” → “receive” using minimum edit distance.

7 of 19

Spell Checking and Correction

  • Word Processors

  • Content Creation

  • Search Engines

  • Learning Aid

8 of 19

Text Normalization for SE

  • Converts user queries and documents into comparable lexical forms (e.g., stemming, lemmatization).

Example: Searching “running shoes” returns results for “run” and “runners”.

9 of 19

Keyword Extraction and Indexing

  • Identifies most frequent or most informative tokens (TF-IDF, RAKE, etc.).

Example: Auto-generating tags for research papers or news articles.

10 of 19

Keyword Extraction and Indexing

  • Content Summarization

  • Document Classification and Categorization

  • Customer Feedback Analysis

  • Business Intelligence (BI)

  • Academic and Research

  • Fraud Detection and Security

11 of 19

Sentiment Analysis (Lexicon-Based)

  • Uses predefined lexical resources (e.g., positive/negative word lists) to infer sentiment polarity.

Example: Counting words like “happy”, “great”, “terrible” to score movie reviews.

12 of 19

Sentiment Analysis (Lexicon-Based)

  • Customer Feedback Analysis

  • Market Research

  • Crisis Management

  • Behavioral Research

  • Public Health Monitoring

  • Content Filtering/Moderation

  • Accessibility in Low-Resource Languages

13 of 19

IR and Search Optimization

  • Lexical preprocessing (stopword removal, case folding, stemming) improves precision and recall in document retrieval.

Example: Google search uses normalized tokens for query expansion.

14 of 19

Plagiarism Detection

  • Measures lexical overlap, n-gram similarity, and paraphrasing patterns across documents.

Example: Detecting copied or rephrased content using token similarity.

15 of 19

Plagiarism Detection

  • Grading Student Work

  • Formative Feedback

  • AI Writing Detection

  • Protecting Copyright

  • Research Ethics

  • Code Plagiarism

16 of 19

Speech Transcription Post-Processing

  • Lexical normalization is applied after speech-to-text to correct case, punctuation, and numerals.

Example: Converting “three point five percent” → “3.5%”.

17 of 19

Text Simplification

  • Uses lexical richness, average word length, and rare-word frequency to measure text difficulty.

Example: Creating simplified text for children or second-language learners.

18 of 19

Text Simplification

  • Simplifying Medical Information

  • Text Summarization and Question Answering

  • Government and Legal Documents

  • Enhancing Accessibility and Inclusivity

19 of 19