1 of 29

Fundamentals of SNLP

Lecture 3

Lexicalized PCFG; POS Tagging

Krishnendu Ghosh

Source: Chris Manning PPTs

2 of 29

Lexicalization of PCFGs

Introduction

Christopher Manning

3 of 29

(Head) Lexicalization of PCFGs�[Magerman 1995, Collins 1997; Charniak 1997]

The head word of a phrase gives a good representation of the phrase’s structure and meaning
Puts the properties of words back into a PCFG

Christopher Manning

4 of 29

(Head) Lexicalization of PCFGs�[Magerman 1995, Collins 1997; Charniak 1997]

The head word of a phrase gives a good representation of the phrase’s structure and meaning
Puts the properties of words back into a PCFG

Christopher Manning

5 of 29

(Head) Lexicalization of PCFGs�[Magerman 1995, Collins 1997; Charniak 1997]

Word-to-word affinities are useful for certain ambiguities

PP attachment is now (partly) captured in a local PCFG rule.

Think about: What useful information isn’t captured?

Also useful for: coordination scope, verb complement patterns

announce RATES FOR January

PP

NP

VP

ANNOUNCE rates IN January

PP

NP

VP

Christopher Manning

6 of 29

Lexicalized parsing was seen as the parsing breakthrough of the late 1990s

Eugene Charniak, 2000 JHU workshop: “To do better, it is necessary to condition probabilities on the actual words of the sentence. This makes the probabilities much tighter:

p(VP → V NP NP) = 0.00151
p(VP → V NP NP | said) = 0.00001
p(VP → V NP NP | gave) = 0.01980 ”

Michael Collins, 2003 COLT tutorial: “Lexicalized Probabilistic Context-Free Grammars … perform vastly better than PCFGs (88% vs. 73% accuracy)”

Christopher Manning

7 of 29

Lexicalization of PCFGs

The model of Charniak (1997)

Christopher Manning

8 of 29

Charniak (1997)

A very straightforward model of a lexicalized PCFG
Probabilistic conditioning is “top-down” like a regular PCFG

But actual parsing is bottom-up, somewhat like the CKY algorithm we saw

Christopher Manning

9 of 29

Charniak (1997) example

Christopher Manning

10 of 29

Lexicalization models argument selection by sharpening rule expansion probabilities

The probability of different verbal complement frames (i.e., “subcategorizations”) depends on the verb:

Local Tree	come	take	think	want
VP → V	9.5%	2.6%	4.6%	5.7%
VP → V NP	1.1%	32.1%	0.2%	13.9%
VP → V PP	34.5%	3.1%	7.1%	0.3%
VP → V SBAR	6.6%	0.3%	73.0%	0.2%
VP → V S	2.2%	1.3%	4.8%	70.8%
VP → V NP S	0.1%	5.7%	0.0%	0.3%
VP → V PRT NP	0.3%	5.8%	0.0%	0.0%
VP → V PRT PP	6.1%	1.5%	0.2%	0.0%

“monolexical” probabilities

Christopher Manning

11 of 29

Lexicalization sharpens probabilities: Predicting heads

“Bilexical probabilities”

P(prices | n-plural) = .013
P(prices | n-plural, NP) = .013
P(prices | n-plural, NP, S) = .025
P(prices | n-plural, NP, S, v-past) = .052
P(prices | n-plural, NP, S, v-past, fell) = .146

Christopher Manning

12 of 29

Charniak (1997) linear interpolation/shrinkage

Christopher Manning

13 of 29

Charniak (1997) shrinkage example

Christopher Manning

14 of 29

Part-of-speech tagging

A simple but useful form of linguistic analysis

Christopher Manning

15 of 29

Parts of Speech

Perhaps starting with Aristotle in the West (384–322 BCE), there was the idea of having parts of speech

a.k.a lexical categories, word classes, “tags”, POS

It comes from Dionysius Thrax of Alexandria (c. 100 BCE) the idea that is still with us that there are 8 parts of speech

But actually his 8 aren’t exactly the ones we are taught today

Thrax: noun, verb, article, adverb, preposition, conjunction, participle, pronoun
School grammar: noun, verb, adjective, adverb, preposition, conjunction, pronoun, interjection

Christopher Manning

16 of 29

Open class (lexical) words

Closed class (functional)

Nouns

Verbs

Proper

Common

Modals

Main

Adjectives

Adverbs

Prepositions

Particles

Determiners

Conjunctions

Pronouns

… more

IBM

Italy

cat / cats

snow

see

registered

can

had

old older oldest

slowly

to with

off up

the some

and or

he its

Numbers

122,312

one

Interjections

Ow Eh

Christopher Manning

17 of 29

Open vs. Closed classes

Open vs. Closed classes

Closed:

determiners: a, an, the
pronouns: she, he, I
prepositions: on, under, over, near, by, …
Why “closed”?

Open:

Nouns, Verbs, Adjectives, Adverbs.

Christopher Manning

18 of 29

POS Tagging

Words often have more than one POS: back

The back door = JJ
On my back = NN
Win the voters back = RB
Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word.

Christopher Manning

19 of 29

POS Tagging

Input: Plays well with others
Ambiguity: NNS/VBZ UH/JJ/NN/RB IN NNS
Output: Plays/VBZ well/RB with/IN others/NNS
Uses:

Text-to-speech (how do we pronounce “lead”?)
Can write regexps like (Det) Adj* N+ over the output for phrases, etc.
As input to or to speed up a full parser
If you know the tag, you can back off to it in other tasks

Penn Treebank POS tags

Christopher Manning

20 of 29

POS tagging performance

How many tags are correct? (Tag accuracy)

About 97% currently
But baseline is already 90%

Baseline is performance of stupidest possible method

Tag every word with its most frequent tag
Tag unknown words as nouns

Partly easy because

Many words are unambiguous
You get points for them (the, a, etc.) and for punctuation marks!

Christopher Manning

21 of 29

Deciding on the correct part of speech can be difficult even for people

Mrs/NNP Shaefer/NNP never/RB got/VBD around/RP to/TO joining/VBG

All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN the/DT corner/NN

Chateau/NNP Petrus/NNP costs/VBZ around/RB 250/CD

Christopher Manning

22 of 29

How difficult is POS tagging?

About 11% of the word types in the Brown corpus are ambiguous with regard to part of speech
But they tend to be very common words. E.g., that

I know that he is honest = IN
Yes, that play was nice = DT
You can’t go that far = RB

40% of the word tokens are ambiguous

Christopher Manning

23 of 29

Sources of information

What are the main sources of information for POS tagging?

Knowledge of neighboring words

Bill saw that man yesterday
NNP NN DT NN NN
VB VB(D) IN VB NN

Knowledge of word probabilities

man is rarely used as a verb….

The latter proves the most useful, but the former also helps

Christopher Manning

24 of 29

More and Better Features 🡺 Feature-based tagger

Can do surprisingly well just looking at a word by itself:

Word the: the → DT
Lowercased word Importantly: importantly → RB
Prefixes unfathomable: un- → JJ
Suffixes Importantly: -ly → RB
Capitalization Meridian: CAP → NNP
Word shapes 35-year: d-x → JJ

Then build a maxent (or whatever) model to predict tag

Maxent P(t|w): 93.7% overall / 82.6% unknown

Christopher Manning

25 of 29

Overview: POS Tagging Accuracies

Rough accuracies:

Most freq tag: ~90% / ~50%

Trigram HMM: ~95% / ~55%
Maxent P(t|w): 93.7% / 82.6%
TnT (HMM++): 96.2% / 86.0%
MEMM tagger: 96.9% / 86.9%
Bidirectional dependencies: 97.2% / 90.0%
Upper bound: ~98% (human agreement)

Most errors on unknown words

Christopher Manning

26 of 29

How to improve supervised results?

Build better features!

We could fix this with a feature that looked at the next word

We could fix this by linking capitalized words to their lowercase versions

PRP VBD IN RB IN PRP VBD .

They left as soon as he arrived .

NNP NNS VBD VBN .

Intrinsic flaws remained undetected .

RB

JJ

Christopher Manning

27 of 29

Tagging Without Sequence Information

t₀

w₀

Baseline

t₀

w₀

w_-1

w₁

Three Words

Model	Features	Token	Unknown	Sentence
Baseline	56,805	93.69%	82.61%	26.74%
3Words	239,767	96.57%	86.78%	48.27%

Using words only in a straight classifier works as well as a basic (HMM or discriminative) sequence model!!

Christopher Manning

28 of 29

Summary of POS Tagging

For tagging, the change from generative to discriminative model does not by itself result in great improvement

One profits from models for specifying dependence on overlapping features of the observation such as spelling, suffix analysis, etc.

An MEMM allows integration of rich features of the observations, but can suffer strongly from assuming independence from following observations; this effect can be relieved by adding dependence on following words

This additional power (of the MEMM ,CRF, Perceptron models) has been shown to result in improvements in accuracy

The higher accuracy of discriminative models comes at the price of much slower training

Christopher Manning

29 of 29

Colab Link