1 of 15

Introduction to Speech & Natural Language Processing

Lecture 2

Lexical Processing in NLP Part 1

Krishnendu Ghosh

2 of 15

Regular Expression (RE)

A formal language for specifying text strings

How can we search for mentions of these cute animals in text?

woodchuck

woodchucks

Woodchuck

Woodchucks

Groundhog

groundhogs

3 of 15

Disjunctions

Letters inside square brackets []

Ranges using the dash [A-Z]

Pattern

Matches

[wW]oodchuck

Woodchuck, woodchuck

[1234567890]

Any one digit

Pattern

Matches

[A-Z]

An upper case letter

Drenched Blossoms

[a-z]

A lower case letter

my beans were impatient

[0-9]

A single digit

Chapter 1: Down the Rabbit Hole

4 of 15

Negation in Disjunction

Carat as first character in [] negates the list

Note: Carat means negation only when it's first in []

Special characters (., *, +, ?) lose their special meaning inside []

Pattern

Matches

Examples

[^A-Z]

Not an uppercase letter

Oyfn pripetchik

[^Ss]

Neither ‘S’ nor ‘s’

I have no exquisite reason”

[^.]

Not a period

Our resident Djinn

[e^]

Either e or ^

Look up ^ now

5 of 15

Convenient aliases

Pattern

Expansion

Matches

Examples

\d

[0-9]

Any digit

Fahreneit 451

\D

[^0-9]

Any non-digit

Blue Moon

\w

[a-ZA-Z0-9_]

Any alphanumeric or _

Daiyu

\W

[^\w]

Not alphanumeric or _

Look!

\s

[ \r\t\n\f]

Whitespace (space, tab)

Lookup

\S

[^\s]

Not whitespace

Look up

6 of 15

More Disjunction

Groundhog is another name for woodchuck!

The pipe symbol | for disjunction

Pattern

Matches

groundhog|woodchuck

woodchuck

yours|mine

yours

a|b|c

= [abc]

[gG]roundhog|[Ww]oodchuck

Woodchuck

7 of 15

Wildcards, optionality, repetition

Pattern

Matches

Examples

beg.n

Any char

begin begun beg3n beg n

woodchucks?

Optional s

woodchuck woodchucks

to*

0 or more of previous char

t to too tooo

to+

1 or more of previous char

to too tooo toooo

8 of 15

REs play a surprisingly large role

Widely used in both academics and industry

Part of most text processing tasks, even for big neural language model pipelines including text formatting and pre-processing

Very useful for data analysis of any text data

Colab Link

9 of 15

Simple Application: ELIZA

Early NLP system that imitated a Rogerian psychotherapist Joseph Weizenbaum, 1966.

Uses pattern matching to match, e.g.,: “I need X”

and translates them into, e.g. “What would it mean to you if you got X?

Eliza Link

10 of 15

Simple Application: ELIZA

Men are all alike.

IN WHAT WAY

They're always bugging us about something or other.

CAN YOU THINK OF A SPECIFIC EXAMPLE

Well, my boyfriend made me come here.

YOUR BOYFRIEND MADE YOU COME HERE

He says I'm depressed much of the time.

I AM SORRY TO HEAR YOU ARE DEPRESSED

11 of 15

How many words in a sentence?

"I do uh main- mainly business data processing"

Fragments, filled pauses

"Seuss’s cat in the hat is different from other cats!"

Lemma: same stem, part of speech, rough word sense

cat and cats = same lemma

Wordform: the full inflected surface form

cat and cats = different wordforms

12 of 15

How many words in a sentence?

they lay back on the San Francisco grass & looked at the stars and their …

  • Type: an element of the vocabulary.
  • Token: an instance of that type in running text.

How many?

15 tokens (or 14)

13 types (or 12) (or 11?)

13 of 15

How many words in a corpus?

N = number of tokens

V = vocabulary = set of types, |V| is size of vocabulary

Heaps Law = Herdan's Law = where often .67 < β < .75

vocabulary size grows with > square root of the number of word tokens

Tokens = N

Types = |V|

Switchboard phone conversations

2.4 million

20 thousand

Shakespeare

884,000

31 thousand

COCA

440 million

2 million

Google N-grams

1 trillion

13+ million

14 of 15

How many words in a sentence?

bias > baa yya as

diased > daa yya asd

biased > baa yya asd

15 of 15