1 of 72

Neural network based dependency parsing

Deniz Yuret�July 26, 2017

2 of 72

Outline

Parsing
Neural networks
A neural network parser
Experiments
Related work

3 of 72

Parsing

4 of 72

Parsing

Phrase Structure

Dependency

Graph Based

Transition Based

5 of 72

Phrase structure parsing

From: S. Kübler, R. McDonald and J. Nivre. 2009. Dependency parsing. Morgan & Claypool, US.

6 of 72

Dependency parsing

From: S. Kübler, R. McDonald and J. Nivre. 2009. Dependency parsing. Morgan & Claypool, US.

7 of 72

Graph based dependency parsing

From: S. Kübler, R. McDonald and J. Nivre. 2009. Dependency parsing. Morgan & Claypool, US.

8 of 72

Transition based dependency parsing

Transition System: Abstract machine with a set of configurations (states) and transitions. �We use the ArcHybrid transition system with:�
Configurations: (S,B,A)

S: stack of tree fragments, initially empty.
B: buffer of words, initially containing the whole sentence.
A: set of dependency arcs (head,relation,modifier), initially empty.�

Transitions: (SH,RA,LA)

SH: shifts the first word in the buffer to the top of the stack.
RA: right-arc adds a dependency arc between the top 2 words in stack.
LA: left-arc adds a dependency arc between top words of stack and buffer.��SH(S, w|B, A) => (S|w, B, A)�RA_L(S|u|v, B, A) => (S|u, B, A U {(u,L,v)})�LA_L(S|u, v|B, A) => (S, v|B, A U {(v,L,u)})

9 of 72

Transition based dependency parsing

markets

financial

on

effect

little

had

news

Economic

BUFFER

STACK&ARCS

10 of 72

Transition based dependency parsing

markets

financial

on

effect

little

had

news

Economic

BUFFER

STACK&ARCS

SH

11 of 72

Transition based dependency parsing

markets

financial

on

effect

little

had

news

Economic

BUFFER

STACK&ARCS

LA_ATT

ATT

12 of 72

Transition based dependency parsing

markets

financial

on

effect

little

had

news

Economic

BUFFER

STACK&ARCS

SH

ATT

13 of 72

Transition based dependency parsing

markets

financial

on

effect

little

had

news

Economic

BUFFER

STACK&ARCS

LA_SBJ

ATT

SBJ

14 of 72

Transition based dependency parsing

markets

financial

on

effect

little

had

news

Economic

BUFFER

STACK&ARCS

SH

ATT

SBJ

15 of 72

Transition based dependency parsing

markets

financial

on

effect

little

had

news

Economic

BUFFER

STACK&ARCS

SH

ATT

SBJ

16 of 72

Transition based dependency parsing

markets

financial

on

effect

little

had

news

Economic

BUFFER

STACK&ARCS

LA_ATT

ATT

SBJ

ATT

17 of 72

Transition based dependency parsing

markets

financial

on

little

effect

had

news

Economic

BUFFER

STACK&ARCS

SH

ATT

SBJ

ATT

18 of 72

Transition based dependency parsing

little

effect

had

news

Economic

ATT

SBJ

ATT

markets

financial

on

BUFFER

STACK&ARCS

SH

19 of 72

Transition based dependency parsing

little

effect

had

news

Economic

ATT

SBJ

ATT

on

markets

financial

BUFFER

STACK&ARCS

SH

20 of 72

Transition based dependency parsing

little

effect

had

news

Economic

ATT

SBJ

ATT

markets

on

BUFFER

STACK&ARCS

LA_ATT

financial

ATT

21 of 72

Transition based dependency parsing

little

effect

had

news

Economic

ATT

SBJ

ATT

on

financial

markets

BUFFER

STACK&ARCS

SH

ATT

22 of 72

Transition based dependency parsing

little

effect

had

news

Economic

ATT

SBJ

ATT

markets

on

BUFFER

STACK&ARCS

RA_PC

financial

ATT

PC

23 of 72

Transition based dependency parsing

little

effect

had

news

Economic

ATT

SBJ

ATT

markets

on

BUFFER

STACK&ARCS

RA_ATT

financial

ATT

PC

24 of 72

Transition based dependency parsing

effect

little

had

news

Economic

ATT

SBJ

markets

on

BUFFER

STACK&ARCS

RA_OBJ

financial

ATT

PC

OBJ

ATT

25 of 72

Parsing model: decides correct transition based on state

markets

financial

on

little

effect

had

news

Economic

BUFFER

STACK&ARCS

SH

ATT

SBJ

ATT

LA_ATT

LA_SBJ

RA_ATT

RA_SBJ

...

Feature extraction

Decision Module

26 of 72

Why neural networks?

27 of 72

Linear (pre-2014) vs nonlinear parsing models

Linear

Perceptron
Logistic regression
Naive Bayes

Nonlinear

Kernel perceptron
Neural net
SVM

28 of 72

Linear models cannot handle feature conjunctions

input 1

input 2

29 of 72

Linear models cannot handle feature conjunctions

eat

drink

water

cake

VERB

OBJECT

30 of 72

Basic features for a linear parser model (Zhang & Nivre 2011)

S₀

N₀

N₁

N₂

N_0l

S_0l

?

stack word	S₀=”had”
buffer words	N₀=”effect”, N₁=”on”, N₂=”financial”
parents/children	S_0l=”news”, N_0l=”little”, S_0h=null, S_0l2=null, S_0r=null

word/postag	S₀w=”had”, S₀p=”VBD”
distance (S₀,N₀)	d=2
valence	S₀v_l=1, S₀v_r=0
dependency label	S_0ll=”SBJ”, N_0ll=”ATT”
label set	S₀s_l={SBJ}, S₀s_r={}

39 basic features (ZN39)

31 of 72

Nonlinearity through ad-hoc feature conjunctions

Yue Zhang and Joakim Nivre. Transition-based Dependency Parsing with Rich Non-local Features. ACL 2011.

72 conj. features (ZN72)

derived from

39 basic features (ZN39)

32 of 72

Linear model with conj. features�

33 of 72

Linear model with conj. features vs

Linear model with basic features

34 of 72

Neural networks can handle feature conjunctions and nonlinearity, but...

Neural nets are impractical for high dimensional inputs: they scale linearly in input dimensions (in both time and space, assuming fixed number of hidden units).

Example: Penn Treebank has ~10⁵ word vocabulary. Using one-hot vectors, a 10 word feature vector would have ~10⁶ dimensions. With 10⁴ hidden units we get a dense weight matrix of size 10¹⁰.

35 of 72

Solution: using dense embeddings for input features

Sparse word vectors

Dims ~ 10⁵-10⁶
Nonzeros ~ O(1)

Dense word vectors

Dims ~ 10¹-10²
Nonzeros = Dims

0

1

0

...

0

1

0

...

dog

cat

-0.0406

-0.2791

0.0816

0.7720

-0.0614

-0.5068

0.0521

-0.0310

-0.0039

0.2308

...

-0.0565

-0.2136

0.1207

0.8170

-0.0041

-0.6328

0.1004

-0.0002

0.0066

0.3970

...

dog

cat

36 of 72

Model

37 of 72

Problem: decide correct move based on state

markets

financial

on

little

effect

had

news

Economic

BUFFER

STACK&ARCS

SH

ATT

SBJ

ATT

LA_ATT

LA_SBJ

RA_ATT

RA_SBJ

...

Feature extraction

Decision Module

38 of 72

Model components

Character based LSTM extracts word vectors.�
Word based BLSTM extracts context vectors.�
Feature extractor describes current state.�
Decision module (MLP) decides the next transition.

39 of 72

Character based LSTM generates word vectors

Word vector for “Economic”

(*) Single layer LSTM with 350 hidden units preinitialized in a language model.

40 of 72

Word based BLSTM generates context vectors

Word vector for “Economic”

Context vector for “Economic” = hf1 ⨁ hb9

(*) Single layer BLSTM with 300+300 hidden units preinitialized in a language model.

41 of 72

Feature extractor describes current state

BUFFER

STACK

(*) Each state represented by a 4664 dimensional vector.

42 of 72

Decision module decides the next transition

4664 input dimensions

2048 hidden units

softmax layer with 73 transitions

43 of 72

Experiments

44 of 72

Dataset

CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies�
Dependency parsing of 81 treebanks in 49 languages.�
All treebanks use standardized annotation:

17 universal part-of-speech tags.
37 universal dependency relations.�

Koc University system ranked 7th out of 33 participants.�(official results page), (model source code)

45 of 72

CoNLL 2017 Results (all treebanks LAS)

...

46 of 72

Relative contributions of part-of-speech (p), word vector (v), context vector (c)

47 of 72

Relative contributions of part-of-speech (p), word vector (v), context vector (c)

Our BiLSTM language model word vectors perform better than FB vectors.

48 of 72

Relative contributions of part-of-speech (p), word vector (v), context vector (c)

Both POS tags and context vectors have significant contributions on top of word vectors.

49 of 72

Relative contributions of part-of-speech (p), word vector (v), context vector (c)

Context vectors provide independent contribution on top of POS tags

50 of 72

Related work

51 of 72

Embedding generation

Dense continuous

Binary (one-hot)

Embedding Vectors

Random Init

Transfer from related task

1

0

0.2

0.1

0.12

Finetuned

Fixed

52 of 72

Feature extraction

From: Miguel Ballesteros. 2015. Transition based dependency parsing. Algorithms for NLP Lecture Slides.

53 of 72

Decision module

MLP

LSTM

Stack LSTM

54 of 72

Decision module - MLP

x₁

x₂

x₃

x₄

y₁

y₂

y₃

y₄

Introduced in Danqi Chen and Christopher D Manning. 2014. A fast and accurate dependency parser using neural networks. In EMNLP. pages 740–750. Also used in our model.

x_t: parser state�y_t: transition

55 of 72

Decision module - LSTM

x₁

x₂

x₃

x₄

y₁

y₂

y₃

y₄

Adhiguna Kuncoro, Yuichiro Sawai, Kevin Duh, and Yuji Matsumoto. 2016. Dependency parsing

with lstms: An empirical evaluation. CoRR abs/1604.06529.

56 of 72

Decision module - Stack LSTM

Augmentation of LSTMs with a stack pointer and two constant-time operations

Push : read input, add state to top of stack.

Pop : move stack pointer back.

Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition based dependency parsing with stack long-short term memory.

57 of 72

Decision module - Stack LSTM

x₁

y₁

Stack-pointer

PUSH

58 of 72

Decision module - Stack LSTM

x₂

y₂

x₁

y₁

59 of 72

Decision module - Stack LSTM

x₂

y₂

x₁

y₁

PUSH

60 of 72

Decision module - Stack LSTM

x₁

x₂

x₃

y₁

y₂

y₃

61 of 72

Decision module - Stack LSTM

x₁

x₂

x₃

y₁

y₂

y₃

POP

62 of 72

Decision module - Stack LSTM

x₁

x₂

x₃

y₁

y₂

y₃

63 of 72

Decision module - Stack LSTM

x₁

x₂

x₃

y₁

y₂

y₃

PUSH

64 of 72

Decision module - Stack LSTM

x₁

x₂

x₃

y₁

y₂

y₃

x₄

y₄

65 of 72

Decision module - Stack LSTM

x₁

x₂

x₃

y₁

y₂

y₃

x₄

y₄

POP

66 of 72

Decision module - Stack LSTM

x₁

x₂

x₃

y₁

y₂

y₃

x₄

y₄

67 of 72

Decision module - Stack LSTM Parser

From: Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition based dependency parsing with stack long-short term memory.

68 of 72

Training variations

Static vs Dynamic Oracle

Static oracle transitions using gold moves.
Dynamic oracle transitions using predicted moves.
In both cases logp of gold moves maximized.�

69 of 72

Training variations

Global Normalization

Local models normalize a probability vector for each decision.
Global models use unnormalized scores for each decision and normalize at the end for the whole sequence of decisions.

70 of 72

Summary

Introduced transition based dependency parsing.�
Compared linear models of pre-2014 with current neural network based models.�
Described the CoNLL 2017 “Koc University” model where word and context embeddings and parser decisions are decided by neural networks.�
Discussed alternatives in embeddings, features, decision modules and training variations.

71 of 72

Questions?

72 of 72

References

Ömer Kırnap, Berkay Furkan Önder and Deniz Yuret. 2017. Parsing with Context Embeddings. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies.
S. Kübler, R. McDonald and J. Nivre. 2009. Dependency parsing. Morgan & Claypool, US.
Ryan McDonald and Joakim Nivre. 2014. Recent Advances in Dependency Parsing. Tutorial at EACL 2014.
Miguel Ballesteros. 2015. Transition based dependency parsing. Algorithms for NLP Lecture Slides.
Christopher Manning and Richard Socher. 2017. Dependency parsing. cs224n Lecture Slides.
Miguel Ballesteros. Microsoft Research talk.
Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition based dependency parsing with stack long-short term memory. CoRR abs/1505.08075.