1 of 47

NERD for NexGenTV

Lorenzo Canale

1

2 of 47

What's NexGenTV?

Motivation

  • Exploit the convergence of TV and Internet to enable Broadcasters to produce and deliver Second Screen Applications and facilitate Social TV.

Objective

  • Enrich viewers’ TV experience by providing relevant additional information and interactive content on their mobile devices.
  • Leverage on latest multimedia content analytics and social media monitoring to fuel a second screen authoring platform.
  • Provide an end to end platform from broadcast stream to front-end application on mobile devices.

2

3 of 47

NexGenTV Framework

3

4 of 47

NERD Goal

4

- D. Pujadas : Emmanuel Macron veut revenir en arrière, laisser le choix aux mairies. Vous, c'est une des réformes du quinquennat que vous assumez...

- B. Hamon : J'avoue que je suis tombé un peu de ma chaise quand j'ai appris qu'Emmanuel Macron proposait la liberté pour les écoles de fixer les rythmes scolaires.

TRANSCRIPT

Surface

form

Start

char

End

char

Type

Link

(Wikidata)

D. Pujadas

2

11

PERSON

Q2405595

......

…...

…...

…...

…...

Emmanuel Macron

232

246

PERSON

Q3052772

NAMED

ENTITIES

Named entity

recognition / type recognition

(NER)

Named entity

disambiguation

(NED)

5 of 47

NERD extractors on the Web

5

Key idea => building an ensemble method that uses the extractors responses as input and produces a final list of NEs. This NEs list has to outperform the single extractors NEs lists in terms of F1 score both for NER and NED

6 of 47

How to combine the extractor responses?

NER issues

    • each extractor ei has its specific set of types Tei
    • it's necessary to define a set T of final types and to map each extractor type to the corresponding type in T (type alignment)
    • e.g. considering two extractors e1 and e2

6

Te1

Te2

T

PERSON

PERSON

PERSON

LOCATION

PLACE

PLACE

ORGANIZATION

COMPANY

ORGANIZATION

OCCUPATION

JOBTYPE

ROLE

7 of 47

How to combine the extractor responses?

NED issues

    • each extractor ei links the entities to a specific Knowledge Base Kei
    • it's necessary to define a common Knowledge Base K and to replace the links to other KBs to K

7

Surface form

Ke1 (Dbpedia)

Ke2 (Freebase)

K (Wikidata)

Giuseppe Verdi

8 of 47

State of the art : FOX

  • FOX integrates four NER tools:
  • FOX ensemble method is only useful for NER, not for NED
  • FOX implements 15 ensemble learning algorithms
  • The authors showed that Multilayer Perceptron approach gave the best results
  • The authors didn't mention how they performed the type alignment

8

9 of 47

State of the art : FOX

9

Text

Stanford Named Entity Recognizer

(S)

Illinois Named Entity Tagger

(I)

Ottawa Baseline Information Extraction

(O)

Apache OpenNLP Name Finder

(A)

One-Hot

encoder

Types

mapper

Types

mapper

Types

mapper

Types

mapper

Tokenizer

Tokenizer

Tokenizer

Tokenizer

Entities

extractor types list

Entities

extractor types list

Entities

extractor types list

Entities

extractor types list

Entities

normalized

types list

Entities

normalized

types list

Entities

normalized

types list

Entities

normalized

types list

`Tokens

normalized

types list

Tokens

normalized

types list

Tokens

normalized

types list

Tokens

normalized

types list

One-Hot

encoder

One-Hot

encoder

One-Hot

encoder

Input features for MLP

10 of 47

State of the art : FOX

10

Surface

form

Start

char

End

char

Extractor

type

D. Pujadas

2

11

POLITICIAN

......

…...

…...

…...

Emmanuel Macron

232

246

POLITICIAN

Surface form

Start char

End

char

Normalized

type

D. Pujadas

2

11

PERSON

......

…...

…...

…...

Emmanuel Macron

232

246

PERSON

Features vector

0001

0001

0001

......

0001

0001

Entities extractor types list

Entities normalized types list

Token

Normalized

type

D

PERSON

.

PERSON

Pujadas

PERSON

......

......

Emmanuel

PERSON

Macron

PERSON

Type

Features vector

PERSON

0001

ORGANIZATION

0010

PLACE

0100

ROLE

1000

Nan

0000

Normalized types representation

Tokens normalized types list

11 of 47

State of the art : NERD

  • NERD is a web service that allows to compare and evaluate many web different extractors
  • It defines a new way to perform type alignment: Inductive Entity Type Alignment

11

Types ground truth list

PERSON

...

PLACE

Types extractor list

JOURNALIST

...

LOCATION

Naive Bayes

OR

k-Nearest Neighbour

Learned Mapping

12 of 47

State of the art : NERD-ML

  • NERD-ML is a tool related to NERD that combines the extractors. It applies three different machine learning algorithms:
    • Naive Bayes (NB)
    • k-Nearest Neighbor (k-NN)
    • Support Vector Machines (SVM)

12

13 of 47

Features engineering

For my ensemble method I identified 4 different types of features:

  1. surface features
  2. type features
  3. entity features
  4. score features

13

Text

Extractors

Alchemy

Dandelion

Db

Spotlight

TextRazor

Babelfy

Meaning

Cloud

Adel

Open

Calais

Extractors responses

surface features

type features

entity features

score features

14 of 47

  1. Surface features (FastText)

  • FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers.
  • Word Embedding approach
    • skipgram model => each word is represented as a bag of character n-gram
    • a vector representation is associated to each character n-gram
    • words being represented as the sum of these representations
  • Strenghts
    • it considers the morphology of words
    • it is fast
    • it allows us to compute word representations for words that did not appear in the training data
    • it allows to represent each word in a few dimensions
    • it offers the possibility of using precomputed vectors or models trained using Wikipedia for a specific language

14

15 of 47

  1. Surface features

15

Subtitles corpus

Tokenizer

Corpus tokens

Fasttext

(default parameters)

Fasttext

model

1

French Wikipedia

Tokenizer

Wikipedia tokens

Fasttext

(default parameters)

Fasttext

model

2

vector 1

vector 2

16 of 47

  1. Surface features

16

Fasttext model 1

Fasttext model 2

Text

Compute vector

Compute vector

real-valued vector associated to a specific word/token

token embedding computed using the precomputed French model

token embedding computed training FastText using all subtitles corpus

17 of 47

  1. Surface features

17

- D. Pujadas : Emmanuel Macron veut revenir en arrière, laisser le choix aux mairies. Vous, c'est une des réformes du quinquennat que vous assumez...

- B. Hamon : J'avoue que je suis tombé un peu de ma chaise quand j'ai appris qu'Emmanuel Macron proposait la liberté pour les écoles de fixer les rythmes scolaires.

Token

Surface features

D

......

…...

Macron

TRANSCRIPT

TOKENS

18 of 47

2) Type features

18

Level 1

Level 2

=> entity : (surface form, start, end type, link)

=> extractor

=> features vector representing the type for the named entity and the extractor

=> the number of different types for the level

PERSON -> 0010000

….

MOUNTAIN -> 1001000

19 of 47

2) Type features

19

- D. Pujadas : Emmanuel Macron veut revenir en arrière, laisser le choix aux mairies. Vous, c'est une des réformes du quinquennat que vous assumez...

- B. Hamon : J'avoue que je suis tombé un peu de ma chaise quand j'ai appris qu'Emmanuel Macron proposait la liberté pour les écoles de fixer les rythmes scolaires.

Surface form

Type

Type features

D. Pujadas

PERSON

0010000

......

…...

…...

Emmanuel Macron

PERSON

0010000

TRANSCRIPT

NAMED ENTITIES

20 of 47

2) Type features

20

- D. Pujadas : Emmanuel Macron veut revenir en arrière, laisser le choix aux mairies. Vous, c'est une des réformes du quinquennat que vous assumez...

- B. Hamon : J'avoue que je suis tombé un peu de ma chaise quand j'ai appris qu'Emmanuel Macron proposait la liberté pour les écoles de fixer les rythmes scolaires.

The type features length depends on the extractor!!!

Surface form

Type

Type features

D. Pujadas

PERSON

001

......

…...

…...

Emmanuel Macron

PERSON

001

NAMED ENTITIES

TRANSCRIPT

21 of 47

3) Entity features (structural graph)

I define the structural graph G as the graph formed by the Wikidata triples that have as predicate one of these properties:

  • instance of (P31)

  • subclass of (P279)

  • part of (P361)

21

human

person

subclass of

Richard Wagner

human

instance of

Earth’s core

Earth

part of

22 of 47

3) Entity features (structural graph)

22

Number of nodes = 1869003

subclass of

instance of

part of

Q35120

23 of 47

3) Entity features (similarity)

23

Q1511

Richard Wagner

German composer

composer

Q1511

Q62095

Johann Andreas Wagner

German scientist

scientist

Q62095

0

0.57

0.17

0

0.9

24 of 47

4) Score features

Some extractors return scores (relevance, confidence) associated to the named entities.

24

- D. Pujadas : Emmanuel Macron veut revenir en arrière, laisser le choix aux mairies. Vous, c'est une des réformes du quinquennat que vous assumez...

- B. Hamon : J'avoue que je suis tombé un peu de ma chaise quand j'ai appris qu'Emmanuel Macron proposait la liberté pour les écoles de fixer les rythmes scolaires.

Surface form

Score relevance

Score confidence

D. Pujadas

0.8

0.77

......

…...

…...

Emmanuel Macron

0.8

0.9

NAMED ENTITIES

TRANSCRIPT

25 of 47

NER task: ENNTR

25

26 of 47

Ensemble for NER

26

Ensemble Neural Network for Type Recognition

(ENNTR)

Which is the right type?

27 of 47

Ensemble for NER (Input layers)

  • IT (input layers related to type features)

27

28 of 47

Ensemble for NER (Input layers)

  • IK (input layers related to score features)

  • IU (input layers related to entity features)

28

29 of 47

Ensemble for NER (Alignment block)

  • MK is formed by H neurons
  • MK activation is linear
  • alignment block strengths
    • it helps avoiding local minima
    • it doesn't privilege input with high dimensionality
    • it aligns the type between the extractor types and the GT types

29

alignment block

30 of 47

Ensemble for NER (Alignment block for IT)

30

31 of 47

Ensemble for NER (Ensemble block)

  • The O activations are linear
  • At each neuron in O corresponds a ground truth type
  • To establish the type for a specific token, the type related with the highest neuron output value is chosen, if it greater than a threshold of 0.5

31

32 of 47

Evaluation for NER (metrics)

  • token based scores

32

PERSON

PERSON

......

......

Nan

PERSON

33 of 47

Evaluation for NER (metrics)

  • brat based scores

33

T1 Person 0 17 Marvin Lee Minsky

T2 Role 33 44 eye surgeon

T3 Role 45 51 father

T4 Person 53 58 Henry

T5 Role 76 82 mother

T6 Person 84 90 Fannie

Brat annotation file example

34 of 47

Evaluation for NER (OKE2016)

34

35 of 47

Evaluation for NER

(NexGenTV corpus)

35

36 of 47

Evaluation for NER (features influence)

36

type

40%

surface

23%

entity

20%

score

17%

37 of 47

NED task: ENND

37

38 of 47

Ensemble for NED

38

Voting mechanism

Ensemble Neural Network for Disambiguation

(ENND)

Is the candidate entity the right one?

39 of 47

Ensemble for NED (Input layers)

  • IU (input layers related to entity features)

39

40 of 47

Ensemble for NED (Input layers)

  • IT (input layers related to type features)

40

41 of 47

Ensemble for NED

41

42 of 47

Ensemble for NED (Voting mechanism)

42

Candidate

Output neuron

value

C1,x

o1,x

......

…...

CN,x

oN,x

CANDIDATES

for token x

oMax,x

>0.5

valid candidate

rejected

candidate

43 of 47

Evaluation for NED (metrics)

  • ENND scores

  • Disambiguation scores

43

categorical cross entropy

Q34

Q3

......

......

Q12

Q12

0.87

1

......

......

0.02

0

44 of 47

Evaluation for NED (scores)

  • OKE2016

  • NexGenTV corpus

44

45 of 47

Summary

NER

NED

  • The ensemble method outperforms the single extractors (F1)
  • How? Improving the precision of the single extractors output NE lists
  • The most useful features are the type ones
  • Removing the type features, the ensemble method reaches different minima and outperforms the single extractors
  • The ensemble method outperforms the single extractors (F1)
  • How? Improving the precision of the single extractors output NE lists
  • The most useful features are the entity ones
  • Also the type improves of the 10% the final F1 score

45

46 of 47

Future work

  • add new extractors to the ensemble ( e.g. Spacy )
  • increase the NexGenTV corpus training data
  • try using Wikidata embedding rather than entities similarities features
  • use Part of Speech tags features
  • use LSTM/BiLSTM for surface features

46

47 of 47

Thank you!

47