1 of 29

Fundamentals of SNLP

Lecture 3

Lexicalized PCFG; POS Tagging

Krishnendu Ghosh

Source: Chris Manning PPTs

2 of 29

Lexicalization of PCFGs

Introduction

Christopher Manning

Christopher Manning

3 of 29

(Head) Lexicalization of PCFGs�[Magerman 1995, Collins 1997; Charniak 1997]

  • The head word of a phrase gives a good representation of the phrases structure and meaning
  • Puts the properties of words back into a PCFG

Christopher Manning

4 of 29

(Head) Lexicalization of PCFGs�[Magerman 1995, Collins 1997; Charniak 1997]

  • The head word of a phrase gives a good representation of the phrases structure and meaning
  • Puts the properties of words back into a PCFG

Christopher Manning

5 of 29

(Head) Lexicalization of PCFGs�[Magerman 1995, Collins 1997; Charniak 1997]

  • Word-to-word affinities are useful for certain ambiguities
    • PP attachment is now (partly) captured in a local PCFG rule.
      • Think about: What useful information isn’t captured?

    • Also useful for: coordination scope, verb complement patterns

announce RATES FOR January

PP

NP

VP

ANNOUNCE rates IN January

PP

NP

VP

Christopher Manning

6 of 29

Lexicalized parsing was seen as the parsing breakthrough of the late 1990s

  • Eugene Charniak, 2000 JHU workshop: “To do better, it is necessary to condition probabilities on the actual words of the sentence. This makes the probabilities much tighter:

    • p(VP V NP NP) = 0.00151
    • p(VP V NP NP | said) = 0.00001
    • p(VP V NP NP | gave) = 0.01980 ”

  • Michael Collins, 2003 COLT tutorial: “Lexicalized Probabilistic Context-Free Grammars … perform vastly better than PCFGs (88% vs. 73% accuracy)”

Christopher Manning

7 of 29

Lexicalization of PCFGs

The model of Charniak (1997)

Christopher Manning

8 of 29

Charniak (1997)

  • A very straightforward model of a lexicalized PCFG
  • Probabilistic conditioning is top-down like a regular PCFG
    • But actual parsing is bottom-up, somewhat like the CKY algorithm we saw

Christopher Manning

9 of 29

Charniak (1997) example

Christopher Manning

10 of 29

Lexicalization models argument selection by sharpening rule expansion probabilities

  • The probability of different verbal complement frames (i.e., subcategorizations) depends on the verb:

Local Tree

come

take

think

want

VP → V

9.5%

2.6%

4.6%

5.7%

VP → V NP

1.1%

32.1%

0.2%

13.9%

VP → V PP

34.5%

3.1%

7.1%

0.3%

VP → V SBAR

6.6%

0.3%

73.0%

0.2%

VP → V S

2.2%

1.3%

4.8%

70.8%

VP → V NP S

0.1%

5.7%

0.0%

0.3%

VP → V PRT NP

0.3%

5.8%

0.0%

0.0%

VP → V PRT PP

6.1%

1.5%

0.2%

0.0%

“monolexical” probabilities

Christopher Manning

11 of 29

Lexicalization sharpens probabilities: Predicting heads

Bilexical probabilities”

  • P(prices | n-plural) = .013
  • P(prices | n-plural, NP) = .013
  • P(prices | n-plural, NP, S) = .025
  • P(prices | n-plural, NP, S, v-past) = .052
  • P(prices | n-plural, NP, S, v-past, fell) = .146

Christopher Manning

12 of 29

Charniak (1997) linear interpolation/shrinkage

Christopher Manning

13 of 29

Charniak (1997) shrinkage example

Christopher Manning

14 of 29

Part-of-speech tagging

A simple but useful form of linguistic analysis

Christopher Manning

Christopher Manning

15 of 29

Parts of Speech

  • Perhaps starting with Aristotle in the West (384–322 BCE), there was the idea of having parts of speech
    • a.k.a lexical categories, word classes, “tags”, POS
  • It comes from Dionysius Thrax of Alexandria (c. 100 BCE) the idea that is still with us that there are 8 parts of speech
    • But actually his 8 aren’t exactly the ones we are taught today
      • Thrax: noun, verb, article, adverb, preposition, conjunction, participle, pronoun
      • School grammar: noun, verb, adjective, adverb, preposition, conjunction, pronoun, interjection

Christopher Manning

16 of 29

Open class (lexical) words

Closed class (functional)

Nouns

Verbs

Proper

Common

Modals

Main

Adjectives

Adverbs

Prepositions

Particles

Determiners

Conjunctions

Pronouns

… more

… more

IBM

Italy

cat / cats

snow

see

registered

can

had

old older oldest

slowly

to with

off up

the some

and or

he its

Numbers

122,312

one

Interjections

Ow Eh

Christopher Manning

17 of 29

Open vs. Closed classes

  • Open vs. Closed classes
    • Closed:
      • determiners: a, an, the
      • pronouns: she, he, I
      • prepositions: on, under, over, near, by, …
      • Why closed?
    • Open:
      • Nouns, Verbs, Adjectives, Adverbs.

Christopher Manning

18 of 29

POS Tagging

  • Words often have more than one POS: back
    • The back door = JJ
    • On my back = NN
    • Win the voters back = RB
    • Promised to back the bill = VB
  • The POS tagging problem is to determine the POS tag for a particular instance of a word.

Christopher Manning

19 of 29

POS Tagging

  • Input: Plays well with others
  • Ambiguity: NNS/VBZ UH/JJ/NN/RB IN NNS
  • Output: Plays/VBZ well/RB with/IN others/NNS
  • Uses:
    • Text-to-speech (how do we pronounce “lead”?)
    • Can write regexps like (Det) Adj* N+ over the output for phrases, etc.
    • As input to or to speed up a full parser
    • If you know the tag, you can back off to it in other tasks

Penn Treebank POS tags

Christopher Manning

20 of 29

POS tagging performance

  • How many tags are correct? (Tag accuracy)
    • About 97% currently
    • But baseline is already 90%
      • Baseline is performance of stupidest possible method
        • Tag every word with its most frequent tag
        • Tag unknown words as nouns
    • Partly easy because
      • Many words are unambiguous
      • You get points for them (the, a, etc.) and for punctuation marks!

Christopher Manning

21 of 29

Deciding on the correct part of speech can be difficult even for people

  • Mrs/NNP Shaefer/NNP never/RB got/VBD around/RP to/TO joining/VBG

  • All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN the/DT corner/NN

  • Chateau/NNP Petrus/NNP costs/VBZ around/RB 250/CD

Christopher Manning

22 of 29

How difficult is POS tagging?

  • About 11% of the word types in the Brown corpus are ambiguous with regard to part of speech
  • But they tend to be very common words. E.g., that
    • I know that he is honest = IN
    • Yes, that play was nice = DT
    • You can’t go that far = RB
  • 40% of the word tokens are ambiguous

Christopher Manning

23 of 29

Sources of information

  • What are the main sources of information for POS tagging?
    • Knowledge of neighboring words
      • Bill saw that man yesterday
      • NNP NN DT NN NN
      • VB VB(D) IN VB NN
    • Knowledge of word probabilities
      • man is rarely used as a verb….
  • The latter proves the most useful, but the former also helps

Christopher Manning

24 of 29

More and Better Features 🡺 Feature-based tagger

  • Can do surprisingly well just looking at a word by itself:
    • Word the: the → DT
    • Lowercased word Importantly: importantly → RB
    • Prefixes unfathomable: un- → JJ
    • Suffixes Importantly: -ly → RB
    • Capitalization Meridian: CAP → NNP
    • Word shapes 35-year: d-x → JJ
  • Then build a maxent (or whatever) model to predict tag
    • Maxent P(t|w): 93.7% overall / 82.6% unknown

Christopher Manning

25 of 29

Overview: POS Tagging Accuracies

  • Rough accuracies:
    • Most freq tag: ~90% / ~50%

    • Trigram HMM: ~95% / ~55%
    • Maxent P(t|w): 93.7% / 82.6%
    • TnT (HMM++): 96.2% / 86.0%
    • MEMM tagger: 96.9% / 86.9%
    • Bidirectional dependencies: 97.2% / 90.0%
    • Upper bound: ~98% (human agreement)

Most errors on unknown words

Christopher Manning

26 of 29

How to improve supervised results?

  • Build better features!

    • We could fix this with a feature that looked at the next word

    • We could fix this by linking capitalized words to their lowercase versions

PRP VBD IN RB IN PRP VBD .

They left as soon as he arrived .

NNP NNS VBD VBN .

Intrinsic flaws remained undetected .

RB

JJ

Christopher Manning

27 of 29

Tagging Without Sequence Information

t0

w0

Baseline

t0

w0

w-1

w1

Three Words

Model

Features

Token

Unknown

Sentence

Baseline

56,805

93.69%

82.61%

26.74%

3Words

239,767

96.57%

86.78%

48.27%

Using words only in a straight classifier works as well as a basic (HMM or discriminative) sequence model!!

Christopher Manning

28 of 29

Summary of POS Tagging

For tagging, the change from generative to discriminative model does not by itself result in great improvement

One profits from models for specifying dependence on overlapping features of the observation such as spelling, suffix analysis, etc.

An MEMM allows integration of rich features of the observations, but can suffer strongly from assuming independence from following observations; this effect can be relieved by adding dependence on following words

This additional power (of the MEMM ,CRF, Perceptron models) has been shown to result in improvements in accuracy

The higher accuracy of discriminative models comes at the price of much slower training

Christopher Manning

29 of 29