1 of 47

NERD for NexGenTV

Lorenzo Canale

1

2 of 47

What's NexGenTV?

Motivation

Exploit the convergence of TV and Internet to enable Broadcasters to produce and deliver Second Screen Applications and facilitate Social TV.

Objective

Enrich viewers’ TV experience by providing relevant additional information and interactive content on their mobile devices.
Leverage on latest multimedia content analytics and social media monitoring to fuel a second screen authoring platform.
Provide an end to end platform from broadcast stream to front-end application on mobile devices.

2

3 of 47

NexGenTV Framework

3

4 of 47

NERD Goal

4

- D. Pujadas : Emmanuel Macron veut revenir en arrière, laisser le choix aux mairies. Vous, c'est une des réformes du quinquennat que vous assumez...

- B. Hamon : J'avoue que je suis tombé un peu de ma chaise quand j'ai appris qu'Emmanuel Macron proposait la liberté pour les écoles de fixer les rythmes scolaires.

TRANSCRIPT

Surface form	Start char	End char	Type	Link (Wikidata)
D. Pujadas	2	11	PERSON	Q2405595
......	…...	…...	…...	…...
Emmanuel Macron	232	246	PERSON	Q3052772

NAMED

ENTITIES

Named entity

recognition / type recognition

(NER)

Named entity

disambiguation

(NED)

5 of 47

NERD extractors on the Web

5

Key idea => building an ensemble method that uses the extractors responses as input and produces a final list of NEs. This NEs list has to outperform the single extractors NEs lists in terms of F1 score both for NER and NED

6 of 47

How to combine the extractor responses?

NER issues

each extractor e_i has its specific set of types Te_i
it's necessary to define a set T of final types and to map each extractor type to the corresponding type in T (type alignment)
e.g. considering two extractors e₁and e₂

6

Te₁	Te₂	T
PERSON	PERSON	PERSON
LOCATION	PLACE	PLACE
ORGANIZATION	COMPANY	ORGANIZATION
OCCUPATION	JOBTYPE	ROLE

7 of 47

How to combine the extractor responses?

NED issues

each extractor e_ilinks the entities to a specific Knowledge Base Ke_i
it's necessary to define a common Knowledge Base K and to replace the links to other KBs to K

7

Surface form	Ke₁(Dbpedia)	Ke₂(Freebase)	K (Wikidata)
Giuseppe Verdi	http://dbpedia.org/resource/Giuseppe_Verdi	http://rdf.freebase.com/ns/m.03d6q	http://www.wikidata.org/entity/Q7317

8 of 47

State of the art : FOX

FOX integrates four NER tools:

FOX ensemble method is only useful for NER, not for NED
FOX implements 15 ensemble learning algorithms
The authors showed that Multilayer Perceptron approach gave the best results
The authors didn't mention how they performed the type alignment

8

9 of 47

State of the art : FOX

9

Text

Stanford Named Entity Recognizer

(S)

Illinois Named Entity Tagger

(I)

Ottawa Baseline Information Extraction

(O)

Apache OpenNLP Name Finder

(A)

One-Hot

encoder

Types

mapper

Types

mapper

Types

mapper

Types

mapper

Tokenizer

Entities

extractor types list

Entities

extractor types list

Entities

extractor types list

Entities

extractor types list

Entities

normalized

types list

Entities

normalized

types list

Entities

normalized

types list

Entities

normalized

types list

`Tokens

normalized

types list

Tokens

normalized

types list

Tokens

normalized

types list

Tokens

normalized

types list

One-Hot

encoder

One-Hot

encoder

One-Hot

encoder

Input features for MLP

10 of 47

State of the art : FOX

10

Surface form	Start char	End char	Extractor type
D. Pujadas	2	11	POLITICIAN
......	…...	…...	…...
Emmanuel Macron	232	246	POLITICIAN

Surface form	Start char	End char	Normalized type
D. Pujadas	2	11	PERSON
......	…...	…...	…...
Emmanuel Macron	232	246	PERSON

Features vector
0001
0001
0001
......
0001
0001

Entities extractor types list

Entities normalized types list

Token	Normalized type
D	PERSON
.	PERSON
Pujadas	PERSON
......	......
Emmanuel	PERSON
Macron	PERSON

Type	Features vector
PERSON	0001
ORGANIZATION	0010
PLACE	0100
ROLE	1000
Nan	0000

Normalized types representation

Tokens normalized types list

11 of 47

State of the art : NERD

NERD is a web service that allows to compare and evaluate many web different extractors
It defines a new way to perform type alignment: Inductive Entity Type Alignment

11

Types ground truth list
PERSON
...
PLACE

Types extractor list
JOURNALIST
...
LOCATION

Naive Bayes

OR

k-Nearest Neighbour

Learned Mapping

12 of 47

State of the art : NERD-ML

NERD-ML is a tool related to NERD that combines the extractors. It applies three different machine learning algorithms:

Naive Bayes (NB)
k-Nearest Neighbor (k-NN)
Support Vector Machines (SVM)

12

13 of 47

Features engineering

For my ensemble method I identified 4 different types of features:

surface features
type features
entity features
score features

13

Text

Extractors

Alchemy

Dandelion

Db

Spotlight

TextRazor

Babelfy

Meaning

Cloud

Adel

Open

Calais

Extractors responses

surface features

type features

entity features

score features

14 of 47

Surface features (FastText)

FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers.
Word Embedding approach

skipgram model => each word is represented as a bag of character n-gram
a vector representation is associated to each character n-gram
words being represented as the sum of these representations

Strenghts

it considers the morphology of words
it is fast
it allows us to compute word representations for words that did not appear in the training data
it allows to represent each word in a few dimensions
it offers the possibility of using precomputed vectors or models trained using Wikipedia for a specific language

14

15 of 47

Surface features

15

Subtitles corpus

Tokenizer

Corpus tokens

Fasttext

(default parameters)

Fasttext

model

1

French Wikipedia

Tokenizer

Wikipedia tokens

Fasttext

(default parameters)

Fasttext

model

2

vector 1

vector 2

precomputed

https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

16 of 47

Surface features

16

Fasttext model 1

Fasttext model 2

Text

Compute vector

real-valued vector associated to a specific word/token

token embedding computed using the precomputed French model

token embedding computed training FastText using all subtitles corpus

17 of 47

Surface features

17

- D. Pujadas : Emmanuel Macron veut revenir en arrière, laisser le choix aux mairies. Vous, c'est une des réformes du quinquennat que vous assumez...

- B. Hamon : J'avoue que je suis tombé un peu de ma chaise quand j'ai appris qu'Emmanuel Macron proposait la liberté pour les écoles de fixer les rythmes scolaires.

Token	Surface features
D
......	…...
Macron

TRANSCRIPT

TOKENS

18 of 47

2) Type features

18

Level 1

Level 2

=> entity : (surface form, start, end type, link)

=> extractor

=> features vector representing the type for the named entity and the extractor

=> the number of different types for the level

PERSON -> 0010000

….

MOUNTAIN -> 1001000

19 of 47

2) Type features

19

- D. Pujadas : Emmanuel Macron veut revenir en arrière, laisser le choix aux mairies. Vous, c'est une des réformes du quinquennat que vous assumez...

- B. Hamon : J'avoue que je suis tombé un peu de ma chaise quand j'ai appris qu'Emmanuel Macron proposait la liberté pour les écoles de fixer les rythmes scolaires.

Surface form	Type	Type features
D. Pujadas	PERSON	0010000
......	…...	…...
Emmanuel Macron	PERSON	0010000

TRANSCRIPT

NAMED ENTITIES

20 of 47

2) Type features

20

- D. Pujadas : Emmanuel Macron veut revenir en arrière, laisser le choix aux mairies. Vous, c'est une des réformes du quinquennat que vous assumez...

- B. Hamon : J'avoue que je suis tombé un peu de ma chaise quand j'ai appris qu'Emmanuel Macron proposait la liberté pour les écoles de fixer les rythmes scolaires.

The type features length depends on the extractor!!!

Surface form	Type	Type features
D. Pujadas	PERSON	001
......	…...	…...
Emmanuel Macron	PERSON	001

NAMED ENTITIES

TRANSCRIPT

21 of 47

3) Entity features (structural graph)

I define the structural graph G as the graph formed by the Wikidata triples that have as predicate one of these properties:

instance of (P31)

subclass of (P279)

part of (P361)

21

human

person

subclass of

Richard Wagner

human

instance of

Earth’s core

Earth

part of

22 of 47

3) Entity features (structural graph)

22

Number of nodes = 1869003

subclass of

instance of

part of

Q35120

23 of 47

3) Entity features (similarity)

23

Q1511

Richard Wagner

German composer

composer

Q1511

Q62095

Johann Andreas Wagner

German scientist

scientist

Q62095

0

0.57

0.17

0

0.9

24 of 47

4) Score features

Some extractors return scores (relevance, confidence) associated to the named entities.

24

- D. Pujadas : Emmanuel Macron veut revenir en arrière, laisser le choix aux mairies. Vous, c'est une des réformes du quinquennat que vous assumez...

- B. Hamon : J'avoue que je suis tombé un peu de ma chaise quand j'ai appris qu'Emmanuel Macron proposait la liberté pour les écoles de fixer les rythmes scolaires.

Surface form	Score relevance	Score confidence
D. Pujadas	0.8	0.77
......	…...	…...
Emmanuel Macron	0.8	0.9

NAMED ENTITIES

TRANSCRIPT

25 of 47

NER task: ENNTR

25

26 of 47

Ensemble for NER

26

Ensemble Neural Network for Type Recognition

(ENNTR)

Which is the right type?

27 of 47

Ensemble for NER (Input layers)

IT (input layers related to type features)

27

28 of 47

Ensemble for NER (Input layers)

IK (input layers related to score features)

IU (input layers related to entity features)

28

29 of 47

Ensemble for NER (Alignment block)

M_K is formed by H neurons
M_Kactivation is linear
alignment block strengths

it helps avoiding local minima
it doesn't privilege input with high dimensionality
it aligns the type between the extractor types and the GT types

29

alignment block

30 of 47

Ensemble for NER (Alignment block for I_T)

30

31 of 47

Ensemble for NER (Ensemble block)

The O activations are linear
At each neuron in O corresponds a ground truth type
To establish the type for a specific token, the type related with the highest neuron output value is chosen, if it greater than a threshold of 0.5

31

32 of 47

Evaluation for NER (metrics)

token based scores

32


PERSON	PERSON
......	......
Nan	PERSON

33 of 47

Evaluation for NER (metrics)

brat based scores

33

T1 Person 0 17 Marvin Lee Minsky

T2 Role 33 44 eye surgeon

T3 Role 45 51 father

T4 Person 53 58 Henry

T5 Role 76 82 mother

T6 Person 84 90 Fannie

Brat annotation file example

BratUtils

34 of 47

Evaluation for NER (OKE2016)

34

35 of 47

Evaluation for NER

(NexGenTV corpus)

35

36 of 47

Evaluation for NER (features influence)

36

type

40%

surface

23%

entity

20%

score

17%

37 of 47

NED task: ENND

37

38 of 47

Ensemble for NED

38

Voting mechanism

Ensemble Neural Network for Disambiguation

(ENND)

Is the candidate entity the right one?

39 of 47

Ensemble for NED (Input layers)

IU (input layers related to entity features)

39

40 of 47

Ensemble for NED (Input layers)

IT (input layers related to type features)

40

41 of 47

Ensemble for NED

41

42 of 47

Ensemble for NED (Voting mechanism)

42

Candidate	Output neuron value
C_1,x	o_1,x
......	…...
C_N,x	o_N,x

CANDIDATES

for token x

o_Max,x

>0.5

valid candidate

rejected

candidate

43 of 47

Evaluation for NED (metrics)

ENND scores

Disambiguation scores

43

categorical cross entropy


Q34	Q3
......	......
Q12	Q12


0.87	1
......	......
0.02	0

44 of 47

Evaluation for NED (scores)

OKE2016

NexGenTV corpus

44

45 of 47

Summary

NER

NED

The ensemble method outperforms the single extractors (F1)
How? Improving the precision of the single extractors output NE lists
The most useful features are the type ones
Removing the type features, the ensemble method reaches different minima and outperforms the single extractors

The ensemble method outperforms the single extractors (F1)
How? Improving the precision of the single extractors output NE lists
The most useful features are the entity ones
Also the type improves of the 10% the final F1 score

45

46 of 47

Future work

add new extractors to the ensemble ( e.g. Spacy )
increase the NexGenTV corpus training data
try using Wikidata embedding rather than entities similarities features
use Part of Speech tags features
use LSTM/BiLSTM for surface features

46

47 of 47

Thank you!

47