The importance of entities
Meltwater Budapest, April 2016
Babak Rasolzadeh, Director of Data Science Research
What is Meltwater?
3
Why?
own brand
competitors
leads
partners
product reviews
own industry
4
What?
Uses Meltwater to find out about new instances of vandalism and break-ins. Often, the victim is in need of services
Uses Meltwater to help determine how public perception of certain ingredient chemicals will influence adoption & sales
Uses Meltwater to be alerted of when certain patent will expire in target markets
Uses Meltwater to monitor the performance and popularity of news anchors and programs
Uses Meltwater social listening to estimate and prevent infrastructure attacks
5
How?
6
NLP & Data Science at Meltwater
Unstructured�Document Stream
Pipeline
Enrichments
Search
/Storage
Enriched Documents
High Performance�Indexes
Processing�Services
API Layer
APPS
Backup Storage
Raw Documents
15 supported languages in pipeline
(EN, DE, SV, NO, FI, ZH, JP, FR, ES, DA, NL, PT, AR, IT, HI)
Typical enrichments
What other than NLP?
DOC3
DOC3
DOC3
DOC3
DOC3
DOC8
Realtime recommender
engine
concept 3
concept 1
concept 2
“British American Tobacco" or "British American Tobbaco" or (BAT near tobacco) or "英美煙草" or (("Lucky Strike" or "Dunhill" or "Pall Mall") near/15 cigarette*)
8
Machine Learning Terminology
9
Challenges with Data Science (NLP) at scale
Pipeline
Enrichments
SV
EN
DE
POS
NER
10
Knowledge Base Strategy
Entities, entities, entities
don - July 2015
Knowledge Base Strategy
What are Named Entities (NE)?
→ I know this man. He might be Charles.
→ He lives in Stockholm. He is Swedish.
12
Knowledge Base Strategy
What is Named Entity Recognition (NER)?
John lives in Stockholm. He works at Ericsson.
Categories of {PER, LOC, ORG, MISC, PROD}
13
Knowledge Base Strategy
What NER is not?
(i.e. not easy!)
14
Why NER?
15
Knowledge Base Strategy
Why NER?
16
Knowledge Base Strategy
Why NER?
Pepsi spooks Coke with
this Halloween themed ad.
Entity specific sentiment analysis a.k.a ELS
17
Knowledge Base Strategy
So what about Social…?
How to do NER? (state-of-the-art)
Supervised Learning
19
Training data
20
NER pipeline
21
Gazetteers help
Extensive lists of names for a specific category
Disadvantages
22
Brown clustering - motivation
Let’s say we want to estimate the likelihood of the bi-gram "to Shanghai", without having seen this in a training set.
The system can obtain a good estimate if it can cluster "Shanghai" with other city names (like “London”, “Beijing”), then make its estimate based on the likelihood of phrases such as "to London", "to Beijing" and "to Denver"
23
Brown clustering (1)
(
)
24
Brown clustering (2)
25
Brown clustering (3)
26
Hmm...easy?
27
Disambiguation
What is the entity category of “Washington”?
28
Different languages
29
Different languages
30
Different languages
Studying of linguistic properties of a language is important!
31
Editorial vs. Social
32
Challenges in Social NER
→ a solution which considers social characteristics of text
33
Challenges in Social NER
Examples of noisy data
34
Solution (1)
Adapting existing features to social properties
(POS tagger of editorial NER performs really poor
when it comes to social documents.)
35
Solution (2)
Weight (importance) of each CRF feature
36
Results
Ritter, A. et al. Named entity recognition in tweets: An experimental study. EMNLP ’11, pages 1524–1534.
37
Knowledge Base Strategy
What about sentiment….?
Document Level Sentiment - how it works
Inter-annotator agreement ~80%*
Document Level Sentiment - how it works
Machine Learning Magic
Supervised learning
Naive bayes - BernoulliNB, GaussianNB, MultinomialNB
Support Vector Machines - LinearSVM, RbfSVM
Maximum Entropy Model - GIS, IIS, MEGAM, TADM
MLP - RecurrentNN
Document Level Sentiment - how it works
Machine Learning Magic
Document Level Sentiment - current status
~60-70% (depending on language)
Not too terrible, considering that human performance is at best ~80%...
...but why is it so hard?
Document Level Sentiment - how it’s used
Document Level Sentiment - how it’s used
Document Level Sentiment - the problem
Document Level Sentiment - the problem
Negative
Neutral
Document Level Sentiment - the problem
“Those numbers underline a growing gap between McDonald's and today's fast-food customers. It will only get wider with another year's worth of the same uninspired fare that has made McDonald's customers easy pickings for Panera Bread, Chick-fil-A, Chipotle Mexican Grill and others.
”
Negative
Positive
Does not make sense for our industry!
Knowledge Base Strategy
Entity Level Sentiment (ELS)
Entity Level Sentiment - motivation
Idea:
Identify the sentiment towards each particular entity in a text!
Entity Level Sentiment - how it works
NER
BMW: Positive
Mercedes: Neutral
Toyota: Negative
…
Entity Level Sentiment - how it works
Entity1: Positive
Entity2: Neutral
Entity3: Negative
…
E1:Positive
E2: Neutral
E3: Negative
E1:Positive
E2: Neutral
E3: Negative
E1:Positive
E2: Neutral
E3: Negative
Entity Level Sentiment - how it works
Entity1: Positive
Entity2: Neutral
Entity3: Negative
…
NER
Entity Level Sentiment - use case
Entity Level Sentiment - current status
(~45%)
Knowledge Base Strategy
The holy grail : The Graph Knowledge Base
don - July 2015
Entities + Relationships
As the types of entities and their relationships grows so does
the capacity to infer insights
that depend on connectivity
and eventually one can
answer questions that
would otherwise not be
possible with only separate datasets!
56
KB Architecture
57
Unstructured�Document Stream
Pipeline
Enrichments
Graph Search
Enriched Documents
High Performance�Indexes
Processing�Services
API Layer
Knowledge�Base
(Graph)
I/O
External Data Providers
Updates/subscriptions
Lookups
APPS
Backup Storage
Raw Documents
Knowledge Base Strategy
Why is it hard?
Composing the KB
59
Data Acquisition trade-offs
High volume
High quality
Cheap
Manual data acquisition
Special crawlers,
Smart algorithms
Acquisitions, partnerships
low quality
expensive
low
volume
60
Composing the KB - Scalability
61
Scalability Requirements - next steps
Companies ~ 100 million worldwide
People ~ 500 million (including media influencers)
Products ~ 500 million
~1 billion entities all the connections
between them
→
billions of nodes, trillions of edges!
62
Composing the KB - New features
63
Improve entity search - company NED
64
Improve entity search - person NED
Robert Gates�22nd Secretary of Defense
William Henry Gates III�former CEO & cofounder of Microsoft
“Who is Mr. Gates?”
65
Emerging competition
66
Map influencer network
influencer score ~ eg. PageRank
67