NERD for NexGenTV
Lorenzo Canale
1
What's NexGenTV?
Motivation
Objective
2
NexGenTV Framework
3
NERD Goal
4
- D. Pujadas : Emmanuel Macron veut revenir en arrière, laisser le choix aux mairies. Vous, c'est une des réformes du quinquennat que vous assumez...
- B. Hamon : J'avoue que je suis tombé un peu de ma chaise quand j'ai appris qu'Emmanuel Macron proposait la liberté pour les écoles de fixer les rythmes scolaires.
TRANSCRIPT
Surface form | Start char | End char | Type | Link (Wikidata) |
D. Pujadas | 2 | 11 | PERSON | Q2405595 |
...... | …... | …... | …... | …... |
Emmanuel Macron | 232 | 246 | PERSON | Q3052772 |
NAMED
ENTITIES
Named entity
recognition / type recognition
(NER)
Named entity
disambiguation
(NED)
NERD extractors on the Web
5
Key idea => building an ensemble method that uses the extractors responses as input and produces a final list of NEs. This NEs list has to outperform the single extractors NEs lists in terms of F1 score both for NER and NED
How to combine the extractor responses?
NER issues
6
Te1 | Te2 | T |
PERSON | PERSON | PERSON |
LOCATION | PLACE | PLACE |
ORGANIZATION | COMPANY | ORGANIZATION |
OCCUPATION | JOBTYPE | ROLE |
How to combine the extractor responses?
NED issues
7
Surface form | Ke1 (Dbpedia) | Ke2 (Freebase) | K (Wikidata) |
Giuseppe Verdi |
State of the art : FOX
8
State of the art : FOX
9
Text
Stanford Named Entity Recognizer
(S)
Illinois Named Entity Tagger
(I)
Ottawa Baseline Information Extraction
(O)
Apache OpenNLP Name Finder
(A)
One-Hot
encoder
Types
mapper
Types
mapper
Types
mapper
Types
mapper
Tokenizer
Tokenizer
Tokenizer
Tokenizer
Entities
extractor types list
Entities
extractor types list
Entities
extractor types list
Entities
extractor types list
Entities
normalized
types list
Entities
normalized
types list
Entities
normalized
types list
Entities
normalized
types list
`Tokens
normalized
types list
Tokens
normalized
types list
Tokens
normalized
types list
Tokens
normalized
types list
One-Hot
encoder
One-Hot
encoder
One-Hot
encoder
Input features for MLP
State of the art : FOX
10
Surface form | Start char | End char | Extractor type |
D. Pujadas | 2 | 11 | POLITICIAN |
...... | …... | …... | …... |
Emmanuel Macron | 232 | 246 | POLITICIAN |
Surface form | Start char | End char | Normalized type |
D. Pujadas | 2 | 11 | PERSON |
...... | …... | …... | …... |
Emmanuel Macron | 232 | 246 | PERSON |
Features vector |
0001 |
0001 |
0001 |
...... |
0001 |
0001 |
Entities extractor types list
Entities normalized types list
Token | Normalized type |
D | PERSON |
. | PERSON |
Pujadas | PERSON |
...... | ...... |
Emmanuel | PERSON |
Macron | PERSON |
Type | Features vector |
PERSON | 0001 |
ORGANIZATION | 0010 |
PLACE | 0100 |
ROLE | 1000 |
Nan | 0000 |
Normalized types representation
Tokens normalized types list
State of the art : NERD
11
Types ground truth list |
PERSON |
... |
PLACE |
Types extractor list |
JOURNALIST |
... |
LOCATION |
Naive Bayes
OR
k-Nearest Neighbour
Learned Mapping
State of the art : NERD-ML
12
Features engineering
For my ensemble method I identified 4 different types of features:
13
Text
Extractors
Alchemy
Dandelion
Db
Spotlight
TextRazor
Babelfy
Meaning
Cloud
Adel
Open
Calais
Extractors responses
surface features
type features
entity features
score features
14
15
Subtitles corpus
Tokenizer
Corpus tokens
Fasttext
(default parameters)
Fasttext
model
1
French Wikipedia
Tokenizer
Wikipedia tokens
Fasttext
(default parameters)
Fasttext
model
2
vector 1
vector 2
16
Fasttext model 1
Fasttext model 2
Text
Compute vector
Compute vector
real-valued vector associated to a specific word/token
token embedding computed using the precomputed French model
token embedding computed training FastText using all subtitles corpus
17
- D. Pujadas : Emmanuel Macron veut revenir en arrière, laisser le choix aux mairies. Vous, c'est une des réformes du quinquennat que vous assumez...
- B. Hamon : J'avoue que je suis tombé un peu de ma chaise quand j'ai appris qu'Emmanuel Macron proposait la liberté pour les écoles de fixer les rythmes scolaires.
Token | Surface features |
D | |
...... | …... |
Macron | |
TRANSCRIPT
TOKENS
2) Type features
18
Level 1
Level 2
=> entity : (surface form, start, end type, link)
=> extractor
=> features vector representing the type for the named entity and the extractor
=> the number of different types for the level
PERSON -> 0010000
….
MOUNTAIN -> 1001000
2) Type features
19
- D. Pujadas : Emmanuel Macron veut revenir en arrière, laisser le choix aux mairies. Vous, c'est une des réformes du quinquennat que vous assumez...
- B. Hamon : J'avoue que je suis tombé un peu de ma chaise quand j'ai appris qu'Emmanuel Macron proposait la liberté pour les écoles de fixer les rythmes scolaires.
Surface form | Type | Type features |
D. Pujadas | PERSON | 0010000 |
...... | …... | …... |
Emmanuel Macron | PERSON | 0010000 |
TRANSCRIPT
NAMED ENTITIES
2) Type features
20
- D. Pujadas : Emmanuel Macron veut revenir en arrière, laisser le choix aux mairies. Vous, c'est une des réformes du quinquennat que vous assumez...
- B. Hamon : J'avoue que je suis tombé un peu de ma chaise quand j'ai appris qu'Emmanuel Macron proposait la liberté pour les écoles de fixer les rythmes scolaires.
The type features length depends on the extractor!!!
Surface form | Type | Type features |
D. Pujadas | PERSON | 001 |
...... | …... | …... |
Emmanuel Macron | PERSON | 001 |
NAMED ENTITIES
TRANSCRIPT
3) Entity features (structural graph)
I define the structural graph G as the graph formed by the Wikidata triples that have as predicate one of these properties:
21
human
person
subclass of
Richard Wagner
human
instance of
Earth’s core
Earth
part of
3) Entity features (structural graph)
22
Number of nodes = 1869003
subclass of
instance of
part of
Q35120
3) Entity features (similarity)
23
Q1511
Richard Wagner
German composer
composer
Q1511
Q62095
Johann Andreas Wagner
German scientist
scientist
Q62095
0
0.57
0.17
0
0.9
4) Score features
Some extractors return scores (relevance, confidence) associated to the named entities.
24
- D. Pujadas : Emmanuel Macron veut revenir en arrière, laisser le choix aux mairies. Vous, c'est une des réformes du quinquennat que vous assumez...
- B. Hamon : J'avoue que je suis tombé un peu de ma chaise quand j'ai appris qu'Emmanuel Macron proposait la liberté pour les écoles de fixer les rythmes scolaires.
Surface form | Score relevance | Score confidence |
D. Pujadas | 0.8 | 0.77 |
...... | …... | …... |
Emmanuel Macron | 0.8 | 0.9 |
NAMED ENTITIES
TRANSCRIPT
NER task: ENNTR
25
Ensemble for NER
26
Ensemble Neural Network for Type Recognition
(ENNTR)
Which is the right type?
Ensemble for NER (Input layers)
27
Ensemble for NER (Input layers)
28
Ensemble for NER (Alignment block)
29
alignment block
Ensemble for NER (Alignment block for IT)
30
Ensemble for NER (Ensemble block)
31
Evaluation for NER (metrics)
32
| |
PERSON | PERSON |
...... | ...... |
Nan | PERSON |
Evaluation for NER (metrics)
33
T1 Person 0 17 Marvin Lee Minsky
T2 Role 33 44 eye surgeon
T3 Role 45 51 father
T4 Person 53 58 Henry
T5 Role 76 82 mother
T6 Person 84 90 Fannie
Brat annotation file example
Evaluation for NER (OKE2016)
34
Evaluation for NER
(NexGenTV corpus)
35
Evaluation for NER (features influence)
36
type
40%
surface
23%
entity
20%
score
17%
NED task: ENND
37
Ensemble for NED
38
Voting mechanism
Ensemble Neural Network for Disambiguation
(ENND)
Is the candidate entity the right one?
Ensemble for NED (Input layers)
39
Ensemble for NED (Input layers)
40
Ensemble for NED
41
Ensemble for NED (Voting mechanism)
42
Candidate | Output neuron value |
C1,x | o1,x |
...... | …... |
CN,x | oN,x |
CANDIDATES
for token x
oMax,x
>0.5
valid candidate
rejected
candidate
Evaluation for NED (metrics)
43
categorical cross entropy
| |
Q34 | Q3 |
...... | ...... |
Q12 | Q12 |
| |
0.87 | 1 |
...... | ...... |
0.02 | 0 |
Evaluation for NED (scores)
44
Summary
NER
NED
45
Future work
46
Thank you!
47