Introduction to Speech & Natural Language Processing
Lecture 2
Lexical Processing in NLP Part 1
Krishnendu Ghosh
Regular Expression (RE)
A formal language for specifying text strings
How can we search for mentions of these cute animals in text?
woodchuck
woodchucks
Woodchuck
Woodchucks
Groundhog
groundhogs
Disjunctions
Letters inside square brackets []
Ranges using the dash [A-Z]
Pattern | Matches |
[wW]oodchuck | Woodchuck, woodchuck |
[1234567890] | Any one digit |
Pattern | Matches | |
[A-Z] | An upper case letter | Drenched Blossoms |
[a-z] | A lower case letter | my beans were impatient |
[0-9] | A single digit | Chapter 1: Down the Rabbit Hole |
Negation in Disjunction
Carat as first character in [] negates the list
Note: Carat means negation only when it's first in []
Special characters (., *, +, ?) lose their special meaning inside []
Pattern | Matches | Examples |
[^A-Z] | Not an uppercase letter | Oyfn pripetchik |
[^Ss] | Neither ‘S’ nor ‘s’ | I have no exquisite reason” |
[^.] | Not a period | Our resident Djinn |
[e^] | Either e or ^ | Look up ^ now |
Convenient aliases
Pattern | Expansion | Matches | Examples |
\d | [0-9] | Any digit | Fahreneit 451 |
\D | [^0-9] | Any non-digit | Blue Moon |
\w | [a-ZA-Z0-9_] | Any alphanumeric or _ | Daiyu |
\W | [^\w] | Not alphanumeric or _ | Look! |
\s | [ \r\t\n\f] | Whitespace (space, tab) | Look␣up |
\S | [^\s] | Not whitespace | Look up |
More Disjunction
Groundhog is another name for woodchuck!
The pipe symbol | for disjunction
Pattern | Matches |
groundhog|woodchuck | woodchuck |
yours|mine | yours |
a|b|c | = [abc] |
[gG]roundhog|[Ww]oodchuck | Woodchuck |
Wildcards, optionality, repetition
Pattern | Matches | Examples |
beg.n | Any char | begin begun beg3n beg n |
woodchucks? | Optional s | woodchuck woodchucks |
to* | 0 or more of previous char | t to too tooo |
to+ | 1 or more of previous char | to too tooo toooo |
REs play a surprisingly large role
Widely used in both academics and industry
Part of most text processing tasks, even for big neural language model pipelines including text formatting and pre-processing
Very useful for data analysis of any text data
Simple Application: ELIZA
Early NLP system that imitated a Rogerian psychotherapist Joseph Weizenbaum, 1966.
Uses pattern matching to match, e.g.,: “I need X”
and translates them into, e.g. “What would it mean to you if you got X?
Simple Application: ELIZA
Men are all alike.
IN WHAT WAY
They're always bugging us about something or other.
CAN YOU THINK OF A SPECIFIC EXAMPLE
Well, my boyfriend made me come here.
YOUR BOYFRIEND MADE YOU COME HERE
He says I'm depressed much of the time.
I AM SORRY TO HEAR YOU ARE DEPRESSED
How many words in a sentence?
"I do uh main- mainly business data processing"
Fragments, filled pauses
"Seuss’s cat in the hat is different from other cats!"
Lemma: same stem, part of speech, rough word sense
cat and cats = same lemma
Wordform: the full inflected surface form
cat and cats = different wordforms
How many words in a sentence?
they lay back on the San Francisco grass & looked at the stars and their …
How many?
15 tokens (or 14)
13 types (or 12) (or 11?)
How many words in a corpus?
N = number of tokens
V = vocabulary = set of types, |V| is size of vocabulary
Heaps Law = Herdan's Law = where often .67 < β < .75
vocabulary size grows with > square root of the number of word tokens
| Tokens = N | Types = |V| |
Switchboard phone conversations | 2.4 million | 20 thousand |
Shakespeare | 884,000 | 31 thousand |
COCA | 440 million | 2 million |
Google N-grams | 1 trillion | 13+ million |
How many words in a sentence?
bias > baa yya as
diased > daa yya asd
biased > baa yya asd