Tokenizing: Parsing Documents
Marc Ratkovic
Chair of Social Data Science
Professor of Political Science and Data Science
University of Mannheim
Agenda
Slides based on Speech and Language Processing by Martin and Jurafsky
Introduction
This chapter explores essential tools in natural language processing (NLP) including regular expressions, text normalization, and edit distance. Regular expressions are powerful patterns used to match strings, which play a crucial role in searching and manipulating text efficiently. Text normalization is the process of converting text into a standard format, making it easier to analyze. Edit distance quantifies how similar two strings are by counting the minimum number of operations (insertions, deletions, substitutions) required to transform one string into another. The historical context of these concepts is significantly highlighted by the development of ELIZA, an early NLP system that utilized simple pattern matching to engage users in conversation, laying the groundwork for modern conversational agents.
ELIZA
Overview of Regular Expressions
Definition and Importance
Basic Patterns
Examples of Matching
Regular expressions (regex) are sequences of characters that define search patterns. They are essential in natural language processing (NLP) for text manipulation, data extraction, and pattern recognition.
Basic regex patterns include concatenation, where characters are placed in sequence, such as /hello/. This matches the exact string 'hello'. Regular expressions are case-sensitive, differentiating between 'A' and 'a'.
To match digits, the pattern /[0-9]/ is used, which matches any single digit character. Concatenating with other characters, /[A-Za-z]/ matches any upper or lower case letter.
Basic Regular Expression Patterns
Kleene Plus (+)
Concatenation
Kleene Star (*)
Anchors
Concatenation is the simplest form of a regular expression, where characters are placed in sequence. For example, the regex /abc/ matches the string 'abc'.
The Kleene star allows for matching zero or more occurrences of the preceding character or pattern. For instance, /a*/ matches '', 'a', 'aa', etc.
The Kleene plus indicates one or more occurrences of the previous character. For example, /a+/ matches 'a', 'aa', but not ''.
Anchors are special characters that specify positions in a string. The caret (^) matches the start of a line, while the dollar sign ($) matches the end of a line.
Disjunction, Grouping, and Precedence
Operator precedence determines the order of evaluations, where parentheses have the highest precedence, followed by Kleene star '*', plus '+', and then disjunction '|', e.g., /a*|b/ evaluates as 'a*' first.
Grouping is done with parentheses '()' to apply operators to entire expressions, e.g., /(abc|def)g/ matches 'abcg' or 'defg'.
Disjunction allows matching of alternatives using the pipe symbol '|', e.g., /cat|dog/ matches either 'cat' or 'dog'.
A Simple Regular Expression Example
Basic Regex for 'the'
Case Insensitive Matching
Word Boundaries
The regular expression /the/ matches the lowercase word 'the' in any string. For example, it will find 'the cat sat on the mat'.
Using the regex /[tT]he/, we can match both 'the' and 'The'. This is important for capturing instances at the beginning of sentences.
The expression /\b[tT]he\b/ ensures that 'the' is matched as a whole word, preventing false positives like 'there' or 'other'.
Getting “the” with regex
Introduction to Corpora
Understanding Corpora in NLP
Examples of Notable Corpora
Introduction to Text Normalization
Text normalization is the process of transforming text into a consistent format that is more suitable for analysis. In Natural Language Processing (NLP), it is crucial because it helps in improving the accuracy of text-based algorithms by reducing variability in the data. This includes handling variations in spelling, punctuation, and grammatical structures, which can otherwise lead to inconsistencies and misinterpretations in data processing. By normalizing text, NLP systems can better understand and process language, ultimately enhancing machine learning models and improving the performance of applications such as sentiment analysis, information retrieval, and language translation.
Unix Tools for Word Tokenization
The 'uniq -c' command counts the occurrences of each unique word, providing a frequency distribution of words in the text.
The 'sort' command organizes the output alphabetically, allowing for easy subsequent processing of the tokenized words.
The 'tr' command is used for translating or deleting characters in a file. It can replace non-alphabetic characters with newlines, effectively tokenizing words by placing each word on a new line.
Types of Text Normalization
Tokenization
Lemmatization
Stemming
Edit Distance
Tokenization breaks text into smaller units, such as words or phrases. It prepares the text for further processing, identifying emoticons and hashtags as tokens.
Lemmatization reduces words to their base or root form. For example, 'ran' becomes 'run', ensuring that different forms of a word are treated as the same item.
Stemming removes prefixes and suffixes from words to obtain the root form. For instance, 'unhappy' is reduced to 'happy', though it may lose some grammatical context.
Edit distance quantifies the difference between two strings by counting the minimum number of operations (insertions, deletions, substitutions) required to transform one string into another.
Tokenization
What is Tokenization?
Lemmatization
Understanding Lemmatization
Stemming
Understanding Stemming
Edit Distance
Definition and Purpose
Examples of Edit Distance
Word Tokenization Techniques
Advantages of BPE
Top-Down Tokenization
Byte-Pair Encoding (BPE)
Challenges in Tokenization
Top-down tokenization relies on predefined rules to segment text into words. It effectively captures punctuation and special characters, but can struggle with edge cases.
BPE is a bottom-up approach that merges the most frequent pairs of characters or subwords iteratively. It reduces vocabulary size and handles unknown words well.
BPE handles rare words efficiently by breaking them into subword units. This allows for better generalization across different text domains.
Both techniques can face challenges: top-down methods may miss context-specific nuances, while BPE requires a large corpus for effective training and may result in longer sequences.
Edit Distance
Definition and Purpose
Examples of Edit Distance
Regular Expressions
Definition and Origin
Applications and Examples
Words in NLP
Corpus
Utterance
Lemma and Tokens
A corpus (plural: corpora) is a computer-readable collection of texts or speech samples, used for linguistic analysis and natural language processing tasks.
An utterance refers to a spoken sentence or any verbal expression, including disturbances like 'umm' or repetitions, which are relevant in speech processing.
A lemma is a base form of a word, while word tokens refer to the total count of words in a text, indicating frequency and diversity in vocabulary.
| |
| |
| |
Steps in Text Normalization
Tokenization
Normalizing Word Fragments
Segmenting Sentences
Final Review and Quality Check
Tokenization segments text into individual words or phrases, transforming a continuous string into manageable units. For example, UNIX can extract meaningful tokens from sentences.
This step standardizes word forms to a common format, including lemmatization and stemming. For instance, 'running' is normalized to 'run'.
Segmenting sentences breaks text into individual sentences, crucial for understanding structure. For example, this aids in analyzing Shakespeare's poetic works.
In the final step, a review of the normalized text ensures accuracy and consistency, checking for errors in tokenization or normalization.
Tokens List
Token Map
Token Count
Normalized Words List
Lemmas
Stemmed Words
Segmented Sentences
Sentence Map
Sentence Count
Quality Report
Error Log
Final Normalized Text
Difficulties in Text Normalization
Handling Punctuations
Special Characters and URLs
Contractions and Multi-word Expressions
Punctuations within words (e.g., PhD, AT&T, Cap'n) complicate tokenization, requiring special rules for accurate segmentation.
Special characters such as dates (e.g., 6/1/23) and URLs/hashtags pose challenges for normalization, necessitating specific parsing methods.
Expanding contractions (e.g., don't to do not) and recognizing multi-word expressions (e.g., New York City) are crucial for accurate normalization.
Porter Stemmer
Understanding the Porter Stemmer
String Similarity
Understanding String Similarity
Minimum Edit Distance
Concept of Minimum Edit Distance
Importance in NLP
Mathematical Representation
Minimum edit distance quantifies how dissimilar two strings are by counting the minimum number of operations required to transform one string into another. Common operations include insertion, deletion, and substitution.
Minimum edit distance is crucial for various NLP tasks, notably in spelling correction. It helps identify the closest correct word by measuring the cost to change a misspelled word into a valid one.
The minimum edit distance, D(i, j), can be calculated as follows:
D(i, j) = min
{ D(i-1, j) + 1 (deletion)
D(i, j-1) + 1 (insertion)
D(i-1, j-1) + cost(source[i], target[j]) (substitution)
}
where cost is 0 if source[i] = target[j], otherwise it is 1.
Minimum Edit Distance Algorithm
Overview of the Minimum Edit Distance Algorithm
Mathematical Representation
Applications of Edit Distance
Spelling Correction
Search Engine Optimization
Coreference Resolution
Text Similarity Measurement
Minimum edit distance is widely used in spelling correction applications. By calculating the distance between a misspelled word and potential corrections, systems can suggest the closest match.
Search engines utilize edit distance algorithms to improve search results. It helps in matching queries against indexed terms by providing relevant suggestions based on similar spellings.
In natural language processing, edit distance assists in coreference resolution. It helps determine if two phrases refer to the same entity by measuring the similarity between them.
Edit distance is employed in text similarity tasks, enabling systems to quantify the similarity between documents. This is useful for plagiarism detection and duplicate content identification.
Conclusion
This chapter highlighted the critical role of regular expressions, text normalization, and edit distance in natural language processing (NLP). Regular expressions serve as powerful tools for pattern matching, enabling efficient text search and manipulation. Text normalization processes, including tokenization, lemmatization, and stemming, are essential for standardizing text data, which enhances the accuracy of NLP models. The minimum edit distance algorithm is crucial for tasks such as spell checking and string comparison, allowing for the quantification of differences between strings. Together, these concepts form the backbone of many NLP applications, facilitating better understanding and processing of human language.