�Part-Of-Speech Tagging,�Sequence Labeling, and�Hidden Markov Models (HMMs)
1
1
2
The/? grand/? jury/? commented/? on/? a/? number/? of/? other/? topics/? ./..
Word/ POS-tag
3
The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./..
Part Of Speech Tagging
4
The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./..
English POS Tagsets
5
Penn Treebank Tagset
6
English Parts of Speech
7
English Parts of Speech (cont.)
8
Closed vs. Open Class
9
Ambiguity in POS Tagging
10
POS Tagging Process
11
Overview
12
POS Tagging Approaches
13
Classification Learning
14
Beyond Classification Learning
15
Sequence Labeling Problem
16
foo bar blam zonk zonk bar blam
Information Extraction
people organizations places
make model year mileage price
17
Semantic Role Labeling
agent patient source destination instrument
18
Bioinformatics
extron intron
19
Part-of-speech (POS) tagging with Hidden Markov Model (HMM)
20
Collect Data
Labelled
Done!
Create Look up table
Tag our tagset sentence
HMM Learning
21
Evaluating Taggers
22
Data
<Corpora type="Monolingual-POS-TAGGED" Language="Hindi">
<Sentence id=1>
पूर्ण_JJ प्रतिबंध_NN हटाओ_VFM :_SYM इराक_NNP
</Sentence>
<Sentence id=2>
संयुक्त_NNC राष्ट्र_NN ।_SYM
</Sentence>
<Sentence id=3>
इराक_NNP के_PREP विदेश_NNC मंत्री_NN ने_PREP अमरीका_NNP के_PREP उस_PRP प्रस्ताव_NN का_PREP मजाक_NVB उड़ाया_VFM है_VAUX ,_PUNC जिसमें_PRP अमरीका_NNP ने_PREP संयुक्त_NNC राष्ट्र_NN के_PREP प्रतिबंधों_NN को_PREP इराकी_JJ नागरिकों_NN के_PREP लिए_PREP कम_INTF हानिकारक_JJ बनाने_VNN के_PREP लिए_PREP कहा_VFM है_VAUX ।_PUNC
</Sentence>
<Sentence id=4>
विदेश_NNC मंत्री_NN का_PREP कहना_VFM है_VAUX कि_CC चूंकि_CC बगदाद_NNP संयुक्त_NNC राष्ट्र_NN की_PREP मांगों_NN का_PREP पालन_NVB करते_VFM हुए_VAUX अपने_PRP भारी_JJ विनाशकारी_JJ हथियारों_NN को_PREP नष्ट_JVB कर_VFM रहा_VAUX है_VAUX ।_PUNC
</Sentence>
<Sentence id=5>
लिहाजा_CC प्रतिबंधों_NN को_PREP पूर्ण_JJ रूप_NN से_PREP उठा_VFM दिया_VAUX जाना_VAUX चाहिए_VAUX ।_PUNC
</Sentence>
23
24
reference: https://www.youtube.com/watch?v=IGt8HPRARS0
25
reference: https://www.youtube.com/watch?v=IGt8HPRARS0
26
reference: https://www.youtube.com/watch?v=IGt8HPRARS0
this won't work all the time!
Look up tables won't work if a single word can have multiple tags i.e. it won't consider ant context
27
Context?
How do we consider the context… ?
28
Bigram ...
29
30
reference: https://www.youtube.com/watch?v=IGt8HPRARS0
Solve?
31
32
33
reference: https://www.youtube.com/watch?v=IGt8HPRARS0
34
reference: https://www.youtube.com/watch?v=IGt8HPRARS0
35
reference: https://www.youtube.com/watch?v=IGt8HPRARS0
36
37
reference: https://www.youtube.com/watch?v=IGt8HPRARS0
38
reference: https://www.youtube.com/watch?v=IGt8HPRARS0
39
reference: https://www.youtube.com/watch?v=IGt8HPRARS0
40
41
42
reference: https://www.youtube.com/watch?v=IGt8HPRARS0
43
reference: https://www.youtube.com/watch?v=IGt8HPRARS0
44
reference: https://www.youtube.com/watch?v=IGt8HPRARS0
Problems with Sequence Labeling as Classification
45
Probabilistic Sequence Models
46
Markov Model / Markov Chain
47
Sample Markov Model for POS
48
0.95
0.9
0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
start
0.1
0.5
0.4
Det
Noun
PropNoun
Verb
Sample Markov Model for POS
49
0.95
0.9
0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
start
0.1
0.5
0.4
Det
Noun
PropNoun
Verb
P(PropNoun Verb Det Noun) = 0.4*0.8*0.25*0.95*0.1=0.0076
Hidden Markov Model
50
Sample HMM for POS
51
PropNoun
John
Mary
Alice
Jerry
Tom
Noun
cat
dog
car
pen
bed
apple
Det
a
the
the
the
that
a
the
a
Verb
bit
ate
saw
played
hit
0.95
0.9
gave
0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
start
0.1
0.5
0.4
Sample HMM Generation
52
PropNoun
John
Mary
Alice
Jerry
Tom
Noun
cat
dog
car
pen
bed
apple
Det
a
the
the
the
that
a
the
a
Verb
bit
ate
saw
played
hit
0.95
0.9
gave
0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
start
0.1
0.5
0.4
Sample HMM Generation
53
PropNoun
John
Mary
Alice
Jerry
Tom
Noun
cat
dog
car
pen
bed
apple
Det
a
the
the
the
that
a
the
a
Verb
bit
ate
saw
played
hit
0.95
0.9
gave
0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
start
0.1
0.5
0.4
Sample HMM Generation
54
PropNoun
John
Mary
Alice
Jerry
Tom
Noun
cat
dog
car
pen
bed
apple
Det
a
the
the
the
that
a
the
a
Verb
bit
ate
saw
played
hit
0.95
0.9
gave
0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John
start
0.1
0.5
0.4
Sample HMM Generation
55
PropNoun
John
Mary
Alice
Jerry
Tom
Noun
cat
dog
car
pen
bed
apple
Det
a
the
the
the
that
a
the
a
Verb
bit
ate
saw
played
hit
0.95
0.9
gave
0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John
start
0.1
0.5
0.4
Sample HMM Generation
56
PropNoun
John
Mary
Alice
Jerry
Tom
Noun
cat
dog
car
pen
bed
apple
Det
a
the
the
the
that
a
the
a
Verb
bit
ate
saw
played
hit
0.95
0.9
gave
0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit
start
0.1
0.5
0.4
Sample HMM Generation
57
PropNoun
John
Mary
Alice
Jerry
Tom
Noun
cat
dog
car
pen
bed
apple
Det
a
the
the
the
that
a
the
a
Verb
bit
ate
saw
played
hit
0.95
0.9
gave
0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit
start
0.1
0.5
0.4
Sample HMM Generation
58
PropNoun
John
Mary
Alice
Jerry
Tom
Noun
cat
dog
car
pen
bed
apple
Det
a
the
the
the
that
a
the
a
Verb
bit
ate
saw
played
hit
0.95
0.9
gave
0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit the
start
0.1
0.5
0.4
Sample HMM Generation
59
PropNoun
John
Mary
Alice
Jerry
Tom
Noun
cat
dog
car
pen
bed
apple
Det
a
the
the
the
that
a
the
a
Verb
bit
ate
saw
played
hit
0.95
0.9
gave
0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit the
start
0.1
0.5
0.4
Sample HMM Generation
60
PropNoun
John
Mary
Alice
Jerry
Tom
Noun
cat
dog
car
pen
bed
apple
Det
a
the
the
the
that
a
the
a
Verb
bit
ate
saw
played
hit
0.95
0.9
gave
0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit the apple
start
0.1
0.5
0.4
Sample HMM Generation
61
PropNoun
John
Mary
Alice
Jerry
Tom
Noun
cat
dog
car
pen
bed
apple
Det
a
the
the
the
that
a
the
a
Verb
bit
ate
saw
played
hit
0.95
0.9
gave
0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit the apple
start
0.1
0.5
0.4
Formal Definition of an HMM
62
HMM Generation Procedure
63
Set initial state q1=s0
For t = 1 to T
Transit to another state qt+1=sj based on transition
distribution aij for state qt
Pick an observation ot=vk based on being in state qt using
distribution bqt(k)
Three Useful HMM Tasks
64
HMM: Observation Likelihood
65
Sequence Classification
66
Austin
Boston
?
?
P(O | Austin) > P(O | Boston) ?
ah s t e n
O
Most Likely Sequence
67
Ordinary English
dice precedent core
vice president Gore
O1
O2
?
?
P(O2 | OrdEnglish) > P(O1 | OrdEnglish) ?
HMM: Observation Likelihood�Naïve Solution
68
HMM: Observation Likelihood�Efficient Solution
69
Forward Trellis
70
s1
s2
sN
∙
∙
∙
∙
∙
∙
s0
sF
∙
∙
∙
∙
∙
∙
∙
∙
∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
t1
t2
t3
tT-1
tT
Forward Probabilities
71
Forward Step
72
s1
s2
sN
∙
∙
∙
sj
αt-1(i)
αt(i)
a1j
a2j
aNj
a2j
Computing the Forward Probabilities
73
Forward Computational Complexity
74
Most Likely State Sequence (Decoding)
75
John gave the dog an apple.
Most Likely State Sequence
76
John gave the dog an apple.
Det Noun PropNoun Verb
Most Likely State Sequence
77
John gave the dog an apple.
Det Noun PropNoun Verb
Most Likely State Sequence
78
John gave the dog an apple.
Det Noun PropNoun Verb
Most Likely State Sequence
79
John gave the dog an apple.
Det Noun PropNoun Verb
Most Likely State Sequence
80
John gave the dog an apple.
Det Noun PropNoun Verb
Most Likely State Sequence
81
John gave the dog an apple.
Det Noun PropNoun Verb
HMM: Most Likely State Sequence�Efficient Solution
82
Viterbi Scores
83
Computing the Viterbi Scores
84
Analogous to Forward algorithm except take max instead of sum
Computing the Viterbi Backpointers
85
Final state in the most probable state sequence. Follow
backpointers to initial state to construct full sequence.
Viterbi Backpointers
86
s1
s2
sN
∙
∙
∙
∙
∙
∙
s0
sF
∙
∙
∙
∙
∙
∙
∙
∙
∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
t1
t2
t3
tT-1
tT
Viterbi Backtrace
87
s1
s2
sN
∙
∙
∙
∙
∙
∙
s0
sF
∙
∙
∙
∙
∙
∙
∙
∙
∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
∙ ∙ ∙
t1
t2
t3
tT-1
tT
Most likely Sequence: s0 sN s1 s2 …s2 sF
HMM Learning
88
Supervised HMM Training
89
Supervised
HMM
Training
John ate the apple
A dog bit Mary
Mary hit the dog
John gave Mary the cat.
.
.
.
Training Sequences
Det Noun PropNoun Verb
Supervised Parameter Estimation
90
Learning and Using HMM Taggers
91
Evaluating Taggers
92
Unsupervised �Maximum Likelihood Training
93
ah s t e n
a s t i n
oh s t u n
eh z t en
.
.
.
Training Sequences
HMM
Training
Austin
Maximum Likelihood Training
94
Bayes Theorem
Simple proof from definition of conditional probability:
QED:
(Def. cond. prob.)
(Def. cond. prob.)
Maximum Likelihood vs. �Maximum A Posteriori (MAP)
HMM: Maximum Likelihood Training�Efficient Solution
97
EM Algorithm
98
EM
99
Unlabeled Examples
+
−
+
−
+
−
+
−
−
+
Assign random probabilistic labels to unlabeled data
Initialize:
EM
100
Prob. Learner
+
−
+
−
+
−
+
−
−
+
Give soft-labeled training data to a probabilistic learner
Initialize:
EM
101
Prob. Learner
Prob.
Classifier
+
−
+
−
+
−
+
−
−
+
Produce a probabilistic classifier
Initialize:
EM
102
Prob. Learner
Prob.
Classifier
Relabel unlabled data using the trained classifier
+
−
+
−
+
−
+
−
−
+
E Step:
EM
103
Prob. Learner
+
−
+
−
+
−
+
−
−
+
Prob.
Classifier
Continue EM iterations until probabilistic labels on unlabeled data converge.
Retrain classifier on relabeled data
M step:
Sketch of Baum-Welch (EM) Algorithm �for Training HMMs
104
Assume an HMM with N states.
Randomly set its parameters λ=(A,B)
(making sure they represent legal distributions)
Until converge (i.e. λ no longer changes) do:
E Step: Use the forward/backward procedure to
determine the probability of various possible
state sequences for generating the training data
M Step: Use these probability estimates to
re-estimate values for all of the parameters λ
See textbook for detailed equations for E and M steps
EM Properties
Semi-Supervised Learning
Semi-Supervised EM
107
Training Examples
-
-
+
+
+
Unlabeled Examples
Prob. Learner
Prob.
Classifier
+
−
+
−
+
−
+
−
−
+
Semi-Supervised EM
108
Training Examples
-
-
+
+
+
Prob. Learner
+
−
+
−
+
−
+
−
−
+
Prob.
Classifier
Semi-Supervised EM
109
Training Examples
-
-
+
+
+
Prob. Learner
+
−
+
−
+
−
+
−
−
+
Prob.
Classifier
Semi-Supervised EM
110
Training Examples
-
-
+
+
+
Unlabeled Examples
Prob. Learner
Prob.
Classifier
+
−
+
−
+
−
+
−
−
+
Semi-Supervised EM
111
Training Examples
-
-
+
+
+
Prob. Learner
+
−
+
−
+
−
+
−
−
+
Prob.
Classifier
Continue retraining iterations until probabilistic labels on unlabeled data converge.
Semi-Supervised Results
Conclusions