1 of 38

Tokenizing: Parsing Documents

Marc Ratkovic

Chair of Social Data Science

Professor of Political Science and Data Science

University of Mannheim

2 of 38

Agenda

Slides based on Speech and Language Processing by Martin and Jurafsky

Introduction
Overview of Regular Expressions
Basic Regular Expression Patterns
Disjunction, Grouping, and Precedence
A Simple Regular Expression Example
Advanced Regular Expression Operators
Complex Regular Expression Example
Substitution and Capture Groups
Lookahead Assertions
Understanding Words in NLP

Introduction to Corpora
Unix Tools for Word Tokenization
Word Tokenization Techniques
Word Normalization Techniques
Sentence Segmentation
Minimum Edit Distance
Minimum Edit Distance Algorithm
Applications of Edit Distance
Conclusion

3 of 38

Introduction

This chapter explores essential tools in natural language processing (NLP) including regular expressions, text normalization, and edit distance. Regular expressions are powerful patterns used to match strings, which play a crucial role in searching and manipulating text efficiently. Text normalization is the process of converting text into a standard format, making it easier to analyze. Edit distance quantifies how similar two strings are by counting the minimum number of operations (insertions, deletions, substitutions) required to transform one string into another. The historical context of these concepts is significantly highlighted by the development of ELIZA, an early NLP system that utilized simple pattern matching to engage users in conversation, laying the groundwork for modern conversational agents.

5 of 38

Overview of Regular Expressions

Definition and Importance

Basic Patterns

Examples of Matching

Regular expressions (regex) are sequences of characters that define search patterns. They are essential in natural language processing (NLP) for text manipulation, data extraction, and pattern recognition.

Basic regex patterns include concatenation, where characters are placed in sequence, such as /hello/. This matches the exact string 'hello'. Regular expressions are case-sensitive, differentiating between 'A' and 'a'.

To match digits, the pattern /[0-9]/ is used, which matches any single digit character. Concatenating with other characters, /[A-Za-z]/ matches any upper or lower case letter.

6 of 38

Basic Regular Expression Patterns

Kleene Plus (+)

Concatenation

Kleene Star (*)

Anchors

Concatenation is the simplest form of a regular expression, where characters are placed in sequence. For example, the regex /abc/ matches the string 'abc'.

The Kleene star allows for matching zero or more occurrences of the preceding character or pattern. For instance, /a*/ matches '', 'a', 'aa', etc.

The Kleene plus indicates one or more occurrences of the previous character. For example, /a+/ matches 'a', 'aa', but not ''.

Anchors are special characters that specify positions in a string. The caret (^) matches the start of a line, while the dollar sign ($) matches the end of a line.

7 of 38

Disjunction, Grouping, and Precedence

Operator precedence determines the order of evaluations, where parentheses have the highest precedence, followed by Kleene star '*', plus '+', and then disjunction '|', e.g., /a*|b/ evaluates as 'a*' first.

Grouping is done with parentheses '()' to apply operators to entire expressions, e.g., /(abc|def)g/ matches 'abcg' or 'defg'.

Disjunction allows matching of alternatives using the pipe symbol '|', e.g., /cat|dog/ matches either 'cat' or 'dog'.

8 of 38

A Simple Regular Expression Example

Basic Regex for 'the'

Case Insensitive Matching

Word Boundaries

The regular expression /the/ matches the lowercase word 'the' in any string. For example, it will find 'the cat sat on the mat'.

Using the regex /[tT]he/, we can match both 'the' and 'The'. This is important for capturing instances at the beginning of sentences.

The expression /\b[tT]he\b/ ensures that 'the' is matched as a whole word, preventing false positives like 'there' or 'other'.

9 of 38

Getting “the” with regex

10 of 38

Introduction to Corpora

Understanding Corpora in NLP

Examples of Notable Corpora

Corpora are large collections of written or spoken texts used for linguistic analysis.
They are essential for training and evaluating natural language processing models.
Corpora help in identifying language patterns, frequencies, and structures.
Different corpora can focus on specific genres or dialects, enhancing linguistic research.
They provide a basis for statistical analysis and machine learning in NLP.

The Brown Corpus: A million-word collection of diverse texts from various genres, created in the 1960s.
Google N-grams: A vast dataset offering insights into word usage and trends over time, with billions of word sequences.
The Switchboard Corpus: A collection of telephone conversations, useful for studying spoken language patterns.
The Penn Treebank: A corpus with syntactic annotations, valuable for parsing and grammar research.

12 of 38

Introduction to Text Normalization

Text normalization is the process of transforming text into a consistent format that is more suitable for analysis. In Natural Language Processing (NLP), it is crucial because it helps in improving the accuracy of text-based algorithms by reducing variability in the data. This includes handling variations in spelling, punctuation, and grammatical structures, which can otherwise lead to inconsistencies and misinterpretations in data processing. By normalizing text, NLP systems can better understand and process language, ultimately enhancing machine learning models and improving the performance of applications such as sentiment analysis, information retrieval, and language translation.

13 of 38

Unix Tools for Word Tokenization

The 'uniq -c' command counts the occurrences of each unique word, providing a frequency distribution of words in the text.

The 'sort' command organizes the output alphabetically, allowing for easy subsequent processing of the tokenized words.

The 'tr' command is used for translating or deleting characters in a file. It can replace non-alphabetic characters with newlines, effectively tokenizing words by placing each word on a new line.

14 of 38

Types of Text Normalization

Tokenization

Lemmatization

Stemming

Edit Distance

Tokenization breaks text into smaller units, such as words or phrases. It prepares the text for further processing, identifying emoticons and hashtags as tokens.

Lemmatization reduces words to their base or root form. For example, 'ran' becomes 'run', ensuring that different forms of a word are treated as the same item.

Stemming removes prefixes and suffixes from words to obtain the root form. For instance, 'unhappy' is reduced to 'happy', though it may lose some grammatical context.

Edit distance quantifies the difference between two strings by counting the minimum number of operations (insertions, deletions, substitutions) required to transform one string into another.

15 of 38

Tokenization

What is Tokenization?

Tokenization is the process of breaking down text into smaller units, called tokens, which can be words, phrases, or symbols.
In Natural Language Processing (NLP), tokenization is crucial as it transforms raw text into a format that can be analyzed and processed more easily.
Tokenization enables the identification and separation of meaningful elements such as words, punctuation, emoticons, and hashtags, facilitating further analysis.

16 of 38

Lemmatization

Understanding Lemmatization

Lemmatization is the process of converting a word to its base or root form, known as a lemma. This process ensures that different forms of a word are treated as a single item in text analysis.
Unlike stemming, which may produce non-existent words, lemmatization results in actual dictionary entries, ensuring grammatical correctness.
For example, the verb 'ran' is lemmatized to 'run', capturing its base form. Similarly, 'better' is lemmatized to 'good'.
Lemmatization improves the accuracy of natural language processing tasks by reducing inflected words to their common base form, aiding in semantic understanding.

17 of 38

Stemming

Understanding Stemming

Stemming is a text normalization process that reduces words to their base or root form, stripping suffixes and prefixes to achieve a simplified version of the word.
Unlike lemmatization, which considers the context and converts a word to its dictionary form, stemming may not produce real words. For example, 'running' becomes 'run' and 'unhappy' becomes 'happy'.
Stemming is often faster than lemmatization as it employs simple algorithms like the Porter Stemmer, which removes common suffixes from words.
While stemming might lead to the loss of grammatical correctness, it helps in grouping similar words for tasks like information retrieval and text mining.

18 of 38

Edit Distance

Definition and Purpose

Examples of Edit Distance

Edit distance quantifies how dissimilar two strings are.
It measures the minimum number of operations needed to transform one string into another.
Operations include deletion, insertion, and substitution of characters.

Transform 'kitten' to 'sitting' requires 3 operations: ','k' to 's', 'e' to 'i', and add 'g'.
Changing 'flaw' to 'lawn' requires 2 operations: substitution of 'f' to 'l' and 'f' to 'n'.
Converting 'book' to 'back' requires 2 operations: substitution of 'o' to 'a' and 'o' to 'k'.

20 of 38

Word Tokenization Techniques

Advantages of BPE

Top-Down Tokenization

Byte-Pair Encoding (BPE)

Challenges in Tokenization

Top-down tokenization relies on predefined rules to segment text into words. It effectively captures punctuation and special characters, but can struggle with edge cases.

BPE is a bottom-up approach that merges the most frequent pairs of characters or subwords iteratively. It reduces vocabulary size and handles unknown words well.

BPE handles rare words efficiently by breaking them into subword units. This allows for better generalization across different text domains.

Both techniques can face challenges: top-down methods may miss context-specific nuances, while BPE requires a large corpus for effective training and may result in longer sequences.

25 of 38

Edit Distance

Definition and Purpose

Examples of Edit Distance

Edit distance quantifies how dissimilar two strings are.
It measures the minimum number of operations needed to transform one string into another.
Operations include deletion, insertion, and substitution of characters.

Transform 'kitten' to 'sitting' requires 3 operations: ','k' to 's', 'e' to 'i', and add 'g'.
Changing 'flaw' to 'lawn' requires 2 operations: substitution of 'f' to 'l' and 'f' to 'n'.
Converting 'book' to 'back' requires 2 operations: substitution of 'o' to 'a' and 'o' to 'k'.

26 of 38

Regular Expressions

Definition and Origin

Applications and Examples

Regular expressions (regex) are sequences of characters that define search patterns.
Originated from formal language theory in the 1950s and 1960s.
Initially developed for pattern matching in text processing within UNIX systems.

Used extensively in text search, data validation, and text manipulation tasks.
Commonly applied in tokenization to effectively segment text.
Example of searching for the word 'the':
- Simple search: /the/ (matches 'the' anywhere in text).
- Case-insensitive search: /[tT]he/ (matches 'the' or 'The').

27 of 38

Words in NLP

Corpus

Utterance

Lemma and Tokens

A corpus (plural: corpora) is a computer-readable collection of texts or speech samples, used for linguistic analysis and natural language processing tasks.

An utterance refers to a spoken sentence or any verbal expression, including disturbances like 'umm' or repetitions, which are relevant in speech processing.

A lemma is a base form of a word, while word tokens refer to the total count of words in a text, indicating frequency and diversity in vocabulary.

28 of 38

Steps in Text Normalization

Tokenization

Normalizing Word Fragments

Segmenting Sentences

Final Review and Quality Check

Tokenization segments text into individual words or phrases, transforming a continuous string into manageable units. For example, UNIX can extract meaningful tokens from sentences.

This step standardizes word forms to a common format, including lemmatization and stemming. For instance, 'running' is normalized to 'run'.

Segmenting sentences breaks text into individual sentences, crucial for understanding structure. For example, this aids in analyzing Shakespeare's poetic works.

In the final step, a review of the normalized text ensures accuracy and consistency, checking for errors in tokenization or normalization.

Tokens List

Token Map

Token Count

Normalized Words List

Lemmas

Stemmed Words

Segmented Sentences

Sentence Map

Sentence Count

Quality Report

Error Log

Final Normalized Text

29 of 38

Difficulties in Text Normalization

Handling Punctuations

Special Characters and URLs

Contractions and Multi-word Expressions

Punctuations within words (e.g., PhD, AT&T, Cap'n) complicate tokenization, requiring special rules for accurate segmentation.

Special characters such as dates (e.g., 6/1/23) and URLs/hashtags pose challenges for normalization, necessitating specific parsing methods.

Expanding contractions (e.g., don't to do not) and recognizing multi-word expressions (e.g., New York City) are crucial for accurate normalization.

30 of 38

Porter Stemmer

Understanding the Porter Stemmer

The Porter Stemmer is an algorithm used for stemming words by removing common morphological and inflectional endings in English.
It processes words to derive their base or root form, which helps in reducing inflected words to a common base form.
For example, words like 'running', 'runner', and 'ran' are reduced to 'run', while 'happiness' is reduced to 'happi'.
While stemming can enhance efficiency in text processing, it may also lead to the loss of grammatical correctness, making sentences harder to interpret.
The algorithm is efficient but can sometimes produce non-words as outputs, which may affect the accuracy of further linguistic analyses.

32 of 38

String Similarity

Understanding String Similarity

String similarity refers to the measure of likeness between two strings, taking into account their characters, order, and structure.
Edit distance is a key concept in string similarity, quantifying the minimum number of operations (insertions, deletions, substitutions) needed to transform one string into another.
For example, the edit distance between 'graffe' and 'giraffe' is 2, as it requires two substitutions: changing 'g' to 'g' (no change) and 'r' to 'i', and inserting 'a'.
String similarity is crucial in applications such as spell checking, plagiarism detection, and natural language processing where variations in text need to be understood.

33 of 38

Minimum Edit Distance

Concept of Minimum Edit Distance

Importance in NLP

Mathematical Representation

Minimum edit distance quantifies how dissimilar two strings are by counting the minimum number of operations required to transform one string into another. Common operations include insertion, deletion, and substitution.

Minimum edit distance is crucial for various NLP tasks, notably in spelling correction. It helps identify the closest correct word by measuring the cost to change a misspelled word into a valid one.

The minimum edit distance, D(i, j), can be calculated as follows:

D(i, j) = min

{ D(i-1, j) + 1 (deletion)

D(i, j-1) + 1 (insertion)

D(i-1, j-1) + cost(source[i], target[j]) (substitution)

}

where cost is 0 if source[i] = target[j], otherwise it is 1.

35 of 38

Minimum Edit Distance Algorithm

Overview of the Minimum Edit Distance Algorithm

Mathematical Representation

The minimum edit distance quantifies the similarity between two strings.
It measures how many operations are needed to transform one string into another.
Operations include insertion, deletion, and substitution of characters.

Let X and Y be the source and target strings respectively.
Define D[i, j] as the edit distance between the first i characters of X and the first j characters of Y.
The recurrence relation is: D[i, j] = min{ D[i-1, j] + 1, D[i, j-1] + 1, D[i-1, j-1] + cost(X[i], Y[j]) }
Where cost(X[i], Y[j]) = 0 if X[i] = Y[j], else 2.

37 of 38

Applications of Edit Distance

Spelling Correction

Search Engine Optimization

Coreference Resolution

Text Similarity Measurement

Minimum edit distance is widely used in spelling correction applications. By calculating the distance between a misspelled word and potential corrections, systems can suggest the closest match.

Search engines utilize edit distance algorithms to improve search results. It helps in matching queries against indexed terms by providing relevant suggestions based on similar spellings.

In natural language processing, edit distance assists in coreference resolution. It helps determine if two phrases refer to the same entity by measuring the similarity between them.

Edit distance is employed in text similarity tasks, enabling systems to quantify the similarity between documents. This is useful for plagiarism detection and duplicate content identification.

38 of 38

Conclusion

This chapter highlighted the critical role of regular expressions, text normalization, and edit distance in natural language processing (NLP). Regular expressions serve as powerful tools for pattern matching, enabling efficient text search and manipulation. Text normalization processes, including tokenization, lemmatization, and stemming, are essential for standardizing text data, which enhances the accuracy of NLP models. The minimum edit distance algorithm is crucial for tasks such as spell checking and string comparison, allowing for the quantification of differences between strings. Together, these concepts form the backbone of many NLP applications, facilitating better understanding and processing of human language.

1 of 38

2 of 38

3 of 38

4 of 38

5 of 38

6 of 38

7 of 38

8 of 38

9 of 38

10 of 38

11 of 38

12 of 38

13 of 38

14 of 38

15 of 38

16 of 38

17 of 38

18 of 38

19 of 38

20 of 38

21 of 38

22 of 38

23 of 38

24 of 38

25 of 38

26 of 38

27 of 38

28 of 38

29 of 38

30 of 38

31 of 38

32 of 38

33 of 38

34 of 38

35 of 38

36 of 38

37 of 38

38 of 38