Neural network based dependency parsing
Deniz Yuret�July 26, 2017
Outline
Parsing
Parsing
Phrase Structure
Dependency
Graph Based
Transition Based
Phrase structure parsing
From: S. Kübler, R. McDonald and J. Nivre. 2009. Dependency parsing. Morgan & Claypool, US.
Dependency parsing
From: S. Kübler, R. McDonald and J. Nivre. 2009. Dependency parsing. Morgan & Claypool, US.
Graph based dependency parsing
From: S. Kübler, R. McDonald and J. Nivre. 2009. Dependency parsing. Morgan & Claypool, US.
Transition based dependency parsing
Transition based dependency parsing
markets
financial
on
effect
little
had
news
Economic
BUFFER
STACK&ARCS
Transition based dependency parsing
markets
financial
on
effect
little
had
news
Economic
BUFFER
STACK&ARCS
SH
Transition based dependency parsing
markets
financial
on
effect
little
had
news
Economic
BUFFER
STACK&ARCS
LAATT
ATT
Transition based dependency parsing
markets
financial
on
effect
little
had
news
Economic
BUFFER
STACK&ARCS
SH
ATT
Transition based dependency parsing
markets
financial
on
effect
little
had
news
Economic
BUFFER
STACK&ARCS
LASBJ
ATT
SBJ
Transition based dependency parsing
markets
financial
on
effect
little
had
news
Economic
BUFFER
STACK&ARCS
SH
ATT
SBJ
Transition based dependency parsing
markets
financial
on
effect
little
had
news
Economic
BUFFER
STACK&ARCS
SH
ATT
SBJ
Transition based dependency parsing
markets
financial
on
effect
little
had
news
Economic
BUFFER
STACK&ARCS
LAATT
ATT
SBJ
ATT
Transition based dependency parsing
markets
financial
on
little
effect
had
news
Economic
BUFFER
STACK&ARCS
SH
ATT
SBJ
ATT
Transition based dependency parsing
little
effect
had
news
Economic
ATT
SBJ
ATT
markets
financial
on
BUFFER
STACK&ARCS
SH
Transition based dependency parsing
little
effect
had
news
Economic
ATT
SBJ
ATT
on
markets
financial
BUFFER
STACK&ARCS
SH
Transition based dependency parsing
little
effect
had
news
Economic
ATT
SBJ
ATT
markets
on
BUFFER
STACK&ARCS
LAATT
financial
ATT
Transition based dependency parsing
little
effect
had
news
Economic
ATT
SBJ
ATT
on
financial
markets
BUFFER
STACK&ARCS
SH
ATT
Transition based dependency parsing
little
effect
had
news
Economic
ATT
SBJ
ATT
markets
on
BUFFER
STACK&ARCS
RAPC
financial
ATT
PC
Transition based dependency parsing
little
effect
had
news
Economic
ATT
SBJ
ATT
markets
on
BUFFER
STACK&ARCS
RAATT
financial
ATT
PC
Transition based dependency parsing
effect
little
had
news
Economic
ATT
SBJ
markets
on
BUFFER
STACK&ARCS
RAOBJ
financial
ATT
PC
OBJ
ATT
ATT
Parsing model: decides correct transition based on state
markets
financial
on
little
effect
had
news
Economic
BUFFER
STACK&ARCS
SH
ATT
SBJ
ATT
LAATT
LASBJ
RAATT
RASBJ
...
...
Feature extraction
Decision Module
Why neural networks?
Linear (pre-2014) vs nonlinear parsing models
Linear
Nonlinear
Linear models cannot handle feature conjunctions
input 1
input 2
Linear models cannot handle feature conjunctions
eat
drink
water
cake
VERB
OBJECT
Basic features for a linear parser model (Zhang & Nivre 2011)
S0
N0
N1
N2
N0l
S0l
?
stack word | S0=”had” |
buffer words | N0=”effect”, N1=”on”, N2=”financial” |
parents/children | S0l=”news”, N0l=”little”, S0h=null, S0l2=null, S0r=null |
word/postag | S0w=”had”, S0p=”VBD” |
distance (S0,N0) | d=2 |
valence | S0vl=1, S0vr=0 |
dependency label | S0ll=”SBJ”, N0ll=”ATT” |
label set | S0sl={SBJ}, S0sr={} |
39 basic features (ZN39)
Nonlinearity through ad-hoc feature conjunctions
Yue Zhang and Joakim Nivre. Transition-based Dependency Parsing with Rich Non-local Features. ACL 2011.
72 conj. features (ZN72)
derived from
39 basic features (ZN39)
Linear model with conj. features�
Linear model with conj. features vs
Linear model with basic features
Neural networks can handle feature conjunctions and nonlinearity, but...
Neural nets are impractical for high dimensional inputs: they scale linearly in input dimensions (in both time and space, assuming fixed number of hidden units).
Solution: using dense embeddings for input features
Sparse word vectors
Dense word vectors
0
0
0
0
1
0
0
0
0
0
0
0
0
0
...
0
0
0
0
0
0
0
0
0
1
0
0
0
0
...
dog
cat
-0.0406
-0.2791
0.0816
0.7720
-0.0614
-0.5068
0.0521
-0.0310
-0.0039
0.2308
...
-0.0565
-0.2136
0.1207
0.8170
-0.0041
-0.6328
0.1004
-0.0002
0.0066
0.3970
...
dog
cat
Model
Problem: decide correct move based on state
markets
financial
on
little
effect
had
news
Economic
BUFFER
STACK&ARCS
SH
ATT
SBJ
ATT
LAATT
LASBJ
RAATT
RASBJ
...
...
Feature extraction
Decision Module
Model components
Character based LSTM generates word vectors
Word vector for “Economic”
(*) Single layer LSTM with 350 hidden units preinitialized in a language model.
Word based BLSTM generates context vectors
Word vector for “Economic”
Context vector for “Economic” = hf1 ⨁ hb9
(*) Single layer BLSTM with 300+300 hidden units preinitialized in a language model.
Feature extractor describes current state
BUFFER
STACK
(*) Each state represented by a 4664 dimensional vector.
Decision module decides the next transition
4664 input dimensions
2048 hidden units
softmax layer with 73 transitions
Experiments
Dataset
CoNLL 2017 Results (all treebanks LAS)
...
Relative contributions of part-of-speech (p), word vector (v), context vector (c)
Relative contributions of part-of-speech (p), word vector (v), context vector (c)
Our BiLSTM language model word vectors perform better than FB vectors.
Relative contributions of part-of-speech (p), word vector (v), context vector (c)
Both POS tags and context vectors have significant contributions on top of word vectors.
Relative contributions of part-of-speech (p), word vector (v), context vector (c)
Context vectors provide independent contribution on top of POS tags
Related work
Embedding generation
Dense continuous
Binary (one-hot)
Embedding Vectors
Random Init
Transfer from related task
1
0
0
0.2
0.1
0.12
Finetuned
Fixed
Feature extraction
From: Miguel Ballesteros. 2015. Transition based dependency parsing. Algorithms for NLP Lecture Slides.
Decision module
Decision module - MLP
x1
x2
x3
x4
y1
y2
y3
y4
Introduced in Danqi Chen and Christopher D Manning. 2014. A fast and accurate dependency parser using neural networks. In EMNLP. pages 740–750. Also used in our model.
xt: parser state�yt: transition
Decision module - LSTM
x1
x2
x3
x4
y1
y2
y3
y4
Adhiguna Kuncoro, Yuichiro Sawai, Kevin Duh, and Yuji Matsumoto. 2016. Dependency parsing
with lstms: An empirical evaluation. CoRR abs/1604.06529.
Decision module - Stack LSTM
Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition based dependency parsing with stack long-short term memory.
Decision module - Stack LSTM
x1
y1
Stack-pointer
PUSH
Decision module - Stack LSTM
x2
y2
x1
y1
Decision module - Stack LSTM
x2
y2
x1
y1
PUSH
Decision module - Stack LSTM
x1
x2
x3
y1
y2
y3
Decision module - Stack LSTM
x1
x2
x3
y1
y2
y3
POP
Decision module - Stack LSTM
x1
x2
x3
y1
y2
y3
Decision module - Stack LSTM
x1
x2
x3
y1
y2
y3
PUSH
Decision module - Stack LSTM
x1
x2
x3
y1
y2
y3
x4
y4
Decision module - Stack LSTM
x1
x2
x3
y1
y2
y3
x4
y4
POP
Decision module - Stack LSTM
x1
x2
x3
y1
y2
y3
x4
y4
Decision module - Stack LSTM Parser
From: Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition based dependency parsing with stack long-short term memory.
Training variations
Training variations
Summary
Questions?
References