1 of 72

Neural network based dependency parsing

Deniz Yuret�July 26, 2017

2 of 72

Outline

  • Parsing
  • Neural networks
  • A neural network parser
  • Experiments
  • Related work

3 of 72

Parsing

4 of 72

Parsing

Phrase Structure

Dependency

Graph Based

Transition Based

5 of 72

Phrase structure parsing

From: S. Kübler, R. McDonald and J. Nivre. 2009. Dependency parsing. Morgan & Claypool, US.

6 of 72

Dependency parsing

From: S. Kübler, R. McDonald and J. Nivre. 2009. Dependency parsing. Morgan & Claypool, US.

7 of 72

Graph based dependency parsing

From: S. Kübler, R. McDonald and J. Nivre. 2009. Dependency parsing. Morgan & Claypool, US.

8 of 72

Transition based dependency parsing

  • Transition System: Abstract machine with a set of configurations (states) and transitions. �We use the ArcHybrid transition system with:�
  • Configurations: (S,B,A)
    • S: stack of tree fragments, initially empty.
    • B: buffer of words, initially containing the whole sentence.
    • A: set of dependency arcs (head,relation,modifier), initially empty.�
  • Transitions: (SH,RA,LA)
    • SH: shifts the first word in the buffer to the top of the stack.
    • RA: right-arc adds a dependency arc between the top 2 words in stack.
    • LA: left-arc adds a dependency arc between top words of stack and buffer.��SH(S, w|B, A) => (S|w, B, A)�RAL(S|u|v, B, A) => (S|u, B, A U {(u,L,v)})�LAL(S|u, v|B, A) => (S, v|B, A U {(v,L,u)})

9 of 72

Transition based dependency parsing

markets

financial

on

effect

little

had

news

Economic

BUFFER

STACK&ARCS

10 of 72

Transition based dependency parsing

markets

financial

on

effect

little

had

news

Economic

BUFFER

STACK&ARCS

SH

11 of 72

Transition based dependency parsing

markets

financial

on

effect

little

had

news

Economic

BUFFER

STACK&ARCS

LAATT

ATT

12 of 72

Transition based dependency parsing

markets

financial

on

effect

little

had

news

Economic

BUFFER

STACK&ARCS

SH

ATT

13 of 72

Transition based dependency parsing

markets

financial

on

effect

little

had

news

Economic

BUFFER

STACK&ARCS

LASBJ

ATT

SBJ

14 of 72

Transition based dependency parsing

markets

financial

on

effect

little

had

news

Economic

BUFFER

STACK&ARCS

SH

ATT

SBJ

15 of 72

Transition based dependency parsing

markets

financial

on

effect

little

had

news

Economic

BUFFER

STACK&ARCS

SH

ATT

SBJ

16 of 72

Transition based dependency parsing

markets

financial

on

effect

little

had

news

Economic

BUFFER

STACK&ARCS

LAATT

ATT

SBJ

ATT

17 of 72

Transition based dependency parsing

markets

financial

on

little

effect

had

news

Economic

BUFFER

STACK&ARCS

SH

ATT

SBJ

ATT

18 of 72

Transition based dependency parsing

little

effect

had

news

Economic

ATT

SBJ

ATT

markets

financial

on

BUFFER

STACK&ARCS

SH

19 of 72

Transition based dependency parsing

little

effect

had

news

Economic

ATT

SBJ

ATT

on

markets

financial

BUFFER

STACK&ARCS

SH

20 of 72

Transition based dependency parsing

little

effect

had

news

Economic

ATT

SBJ

ATT

markets

on

BUFFER

STACK&ARCS

LAATT

financial

ATT

21 of 72

Transition based dependency parsing

little

effect

had

news

Economic

ATT

SBJ

ATT

on

financial

markets

BUFFER

STACK&ARCS

SH

ATT

22 of 72

Transition based dependency parsing

little

effect

had

news

Economic

ATT

SBJ

ATT

markets

on

BUFFER

STACK&ARCS

RAPC

financial

ATT

PC

23 of 72

Transition based dependency parsing

little

effect

had

news

Economic

ATT

SBJ

ATT

markets

on

BUFFER

STACK&ARCS

RAATT

financial

ATT

PC

24 of 72

Transition based dependency parsing

effect

little

had

news

Economic

ATT

SBJ

markets

on

BUFFER

STACK&ARCS

RAOBJ

financial

ATT

PC

OBJ

ATT

ATT

25 of 72

Parsing model: decides correct transition based on state

markets

financial

on

little

effect

had

news

Economic

BUFFER

STACK&ARCS

SH

ATT

SBJ

ATT

LAATT

LASBJ

RAATT

RASBJ

...

...

Feature extraction

Decision Module

26 of 72

Why neural networks?

27 of 72

Linear (pre-2014) vs nonlinear parsing models

Linear

  • Perceptron
  • Logistic regression
  • Naive Bayes

Nonlinear

  • Kernel perceptron
  • Neural net
  • SVM

28 of 72

Linear models cannot handle feature conjunctions

input 1

input 2

29 of 72

Linear models cannot handle feature conjunctions

eat

drink

water

cake

VERB

OBJECT

30 of 72

Basic features for a linear parser model (Zhang & Nivre 2011)

S0

N0

N1

N2

N0l

S0l

?

stack word

S0=”had”

buffer words

N0=”effect”, N1=”on”, N2=”financial”

parents/children

S0l=”news”, N0l=”little”, S0h=null, S0l2=null, S0r=null

word/postag

S0w=”had”, S0p=”VBD”

distance (S0,N0)

d=2

valence

S0vl=1, S0vr=0

dependency label

S0ll=”SBJ”, N0ll=”ATT”

label set

S0sl={SBJ}, S0sr={}

39 basic features (ZN39)

31 of 72

Nonlinearity through ad-hoc feature conjunctions

Yue Zhang and Joakim Nivre. Transition-based Dependency Parsing with Rich Non-local Features. ACL 2011.

72 conj. features (ZN72)

derived from

39 basic features (ZN39)

32 of 72

Linear model with conj. features�

33 of 72

Linear model with conj. features vs

Linear model with basic features

34 of 72

Neural networks can handle feature conjunctions and nonlinearity, but...

Neural nets are impractical for high dimensional inputs: they scale linearly in input dimensions (in both time and space, assuming fixed number of hidden units).

  • Example: Penn Treebank has ~105 word vocabulary. Using one-hot vectors, a 10 word feature vector would have ~106 dimensions. With 104 hidden units we get a dense weight matrix of size 1010.

35 of 72

Solution: using dense embeddings for input features

Sparse word vectors

  • Dims ~ 105-106
  • Nonzeros ~ O(1)

Dense word vectors

  • Dims ~ 101-102
  • Nonzeros = Dims

0

0

0

0

1

0

0

0

0

0

0

0

0

0

...

0

0

0

0

0

0

0

0

0

1

0

0

0

0

...

dog

cat

-0.0406

-0.2791

0.0816

0.7720

-0.0614

-0.5068

0.0521

-0.0310

-0.0039

0.2308

...

-0.0565

-0.2136

0.1207

0.8170

-0.0041

-0.6328

0.1004

-0.0002

0.0066

0.3970

...

dog

cat

36 of 72

Model

37 of 72

Problem: decide correct move based on state

markets

financial

on

little

effect

had

news

Economic

BUFFER

STACK&ARCS

SH

ATT

SBJ

ATT

LAATT

LASBJ

RAATT

RASBJ

...

...

Feature extraction

Decision Module

38 of 72

Model components

  • Character based LSTM extracts word vectors.�
  • Word based BLSTM extracts context vectors.�
  • Feature extractor describes current state.�
  • Decision module (MLP) decides the next transition.

39 of 72

Character based LSTM generates word vectors

Word vector for “Economic”

(*) Single layer LSTM with 350 hidden units preinitialized in a language model.

40 of 72

Word based BLSTM generates context vectors

Word vector for “Economic”

Context vector for “Economic” = hf1 ⨁ hb9

(*) Single layer BLSTM with 300+300 hidden units preinitialized in a language model.

41 of 72

Feature extractor describes current state

BUFFER

STACK

(*) Each state represented by a 4664 dimensional vector.

42 of 72

Decision module decides the next transition

4664 input dimensions

2048 hidden units

softmax layer with 73 transitions

43 of 72

Experiments

44 of 72

Dataset

  • CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies�
  • Dependency parsing of 81 treebanks in 49 languages.�
  • All treebanks use standardized annotation:
    • 17 universal part-of-speech tags.
    • 37 universal dependency relations.�
  • Koc University system ranked 7th out of 33 participants.�(official results page), (model source code)

45 of 72

CoNLL 2017 Results (all treebanks LAS)

...

46 of 72

Relative contributions of part-of-speech (p), word vector (v), context vector (c)

47 of 72

Relative contributions of part-of-speech (p), word vector (v), context vector (c)

Our BiLSTM language model word vectors perform better than FB vectors.

48 of 72

Relative contributions of part-of-speech (p), word vector (v), context vector (c)

Both POS tags and context vectors have significant contributions on top of word vectors.

49 of 72

Relative contributions of part-of-speech (p), word vector (v), context vector (c)

Context vectors provide independent contribution on top of POS tags

50 of 72

Related work

51 of 72

Embedding generation

Dense continuous

Binary (one-hot)

Embedding Vectors

Random Init

Transfer from related task

1

0

0

0.2

0.1

0.12

Finetuned

Fixed

52 of 72

Feature extraction

From: Miguel Ballesteros. 2015. Transition based dependency parsing. Algorithms for NLP Lecture Slides.

53 of 72

Decision module

  • MLP

  • LSTM

  • Stack LSTM

54 of 72

Decision module - MLP

x1

x2

x3

x4

y1

y2

y3

y4

Introduced in Danqi Chen and Christopher D Manning. 2014. A fast and accurate dependency parser using neural networks. In EMNLP. pages 740–750. Also used in our model.

xt: parser state�yt: transition

55 of 72

Decision module - LSTM

x1

x2

x3

x4

y1

y2

y3

y4

Adhiguna Kuncoro, Yuichiro Sawai, Kevin Duh, and Yuji Matsumoto. 2016. Dependency parsing

with lstms: An empirical evaluation. CoRR abs/1604.06529.

56 of 72

Decision module - Stack LSTM

  • Augmentation of LSTMs with a stack pointer and two constant-time operations

  • Push : read input, add state to top of stack.

  • Pop : move stack pointer back.

Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition based dependency parsing with stack long-short term memory.

57 of 72

Decision module - Stack LSTM

x1

y1

Stack-pointer

PUSH

58 of 72

Decision module - Stack LSTM

x2

y2

x1

y1

59 of 72

Decision module - Stack LSTM

x2

y2

x1

y1

PUSH

60 of 72

Decision module - Stack LSTM

x1

x2

x3

y1

y2

y3

61 of 72

Decision module - Stack LSTM

x1

x2

x3

y1

y2

y3

POP

62 of 72

Decision module - Stack LSTM

x1

x2

x3

y1

y2

y3

63 of 72

Decision module - Stack LSTM

x1

x2

x3

y1

y2

y3

PUSH

64 of 72

Decision module - Stack LSTM

x1

x2

x3

y1

y2

y3

x4

y4

65 of 72

Decision module - Stack LSTM

x1

x2

x3

y1

y2

y3

x4

y4

POP

66 of 72

Decision module - Stack LSTM

x1

x2

x3

y1

y2

y3

x4

y4

67 of 72

Decision module - Stack LSTM Parser

From: Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition based dependency parsing with stack long-short term memory.

68 of 72

Training variations

  • Static vs Dynamic Oracle
    • Static oracle transitions using gold moves.
    • Dynamic oracle transitions using predicted moves.
    • In both cases logp of gold moves maximized.�

69 of 72

Training variations

  • Global Normalization
    • Local models normalize a probability vector for each decision.
    • Global models use unnormalized scores for each decision and normalize at the end for the whole sequence of decisions.

70 of 72

Summary

  • Introduced transition based dependency parsing.�
  • Compared linear models of pre-2014 with current neural network based models.�
  • Described the CoNLL 2017 “Koc University” model where word and context embeddings and parser decisions are decided by neural networks.�
  • Discussed alternatives in embeddings, features, decision modules and training variations.

71 of 72

Questions?

72 of 72

References