Understanding Digital Social Trace Data Via Information Extraction
Shubhanshu Mishra1*, NLP Researcher
1 Twitter, Inc.
*Work presented here was done during my PhD at UIUC with multiple collaborators
Content and views expressed in this tutorial are solely the responsibility of the presenter.
PyData MTL - Feb 25, 2021
Agenda
2
10/2/2020
3
10/2/2020
Information extraction tasks https://shubhanshu.com/phd_thesis
Corpus level
Key-phrase extraction
Taxonomy construction
Topic modelling
Document level
Classification
Token level
Tagging
Disambiguation
Information extraction from Social Media Text
4
10/2/2020
Why social media data is challenging?
e.g. CoNLL 03 NER dataset is derived from Reuters.
5
Why social media data is challenging?
Social Media text often has a inherent structure, which provides context, e.g.
6
NER performance difference
7
Source: Derczynski, L., Maynard, D., Rizzo, G., van Erp, M., Gorrell, G., Troncy, R., Petrak, J., & Bontcheva, K. (2015). Analysis of named entity recognition and linking for tweets. Information Processing & Management, 51(2), 32–49. https://doi.org/10.1016/j.ipm.2014.10.006
We need social media specific models to perform well
8
Text classification https://github.com/socialmediaie/SocialMediaIE
9
10/2/2020
Sequence tagging https://github.com/socialmediaie/SocialMediaIE
10
10/2/2020
Applications of information extraction
Index documents by entities
11
10/2/2020
DocID | Entity | Entity type | WikiURL |
1 | Roger Federer | Person | URL1 |
2 | Organization | URL2 | |
3 | Katy Perry | Music Artist | URL3 |
Where is the data?
12
Tagging data
13
10/2/2020
Super sense tagging
Part of speech tagging
Named entity recognition
Chunking
Classification data
14
10/2/2020
Sentiment classification
Abusive content identification
Uncertainty indicator classification
Methods for Extracting Information from Social Media Data
Rules + Models
Multi-task learning
Multilingual learning
Active learning
15
10/2/2020
Rule based Twitter NER Mishra & Diesner (2016). https://github.com/napsternxg/TwitterNER
16
10/2/2020
Mishra, Shubhanshu, & Diesner, Jana (2016). Semi-supervised Named Entity Recognition in noisy-text. In Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT) (pp. 203–212). Osaka, Japan: The COLING 2016 Organizing Committee. Retrieved from https://aclweb.org/anthology/papers/W/W16/W16-3927/
Evaluating Twitter NER (F1-score) Mishra & Diesner (2016).
17
10/2/2020
Rank | TD | TDTE |
10-types | 46.4 | 47.3 |
No-types | 57.3 | 59.0 |
company | 42.1 | 46.2 |
facility | 37.5 | 34.8 |
geo-loc | 70.1 | 71.0 |
movie | 0.0 | 0.0 |
music artist | 7.6 | 5.8 |
other | 31.7 | 32.4 |
person | 51.3 | 52.2 |
product | 10.0 | 9.3 |
sportsteam | 31.3 | 32.0 |
tvshow | 5.7 | 5.7 |
System Name | Precision | Recall | F1 Score |
Stanford CoreNLP | 0.526838069 | 0.453416149 | 0.487377425 |
Stanford CoreNLP (with Twitter POS tagger) | 0.526838069 | 0.453416149 | 0.487377425 |
TwitterNER | 0.661496966 | 0.380822981 | 0.483370288 |
OSU NLP | 0.524096386 | 0.405279503 | 0.45709282 |
Stanford CoreNLP (with caseless models) | 0.547077922 | 0.392468944 | 0.457052441 |
Stanford CoreNLP (with truecasing) | 0.413084823 | 0.421583851 | 0.417291066 |
MITIE | 0.340364057 | 0.457298137 | 0.390260063 |
spaCy | 0.28426543 | 0.380822981 | 0.325535092 |
Polyglot | 0.273080661 | 0.327251553 | 0.297722055 |
NLTK | 0.149006623 | 0.331909938 | 0.205677171 |
TwitterNER (with Hege training data) | 0.657213317 | 0.413819876 | 0.507860886 |
TwitterNER (with W-NUT 2017 training data) | 0.675307842 | 0.404503106 | 0.505948046 |
TwitterNER (with Finin training data) | 0.598086124 | 0.388198758 | 0.470809793 |
TwitterNER (with W-NUT 2017 and Hege training data) | 0.652276759 | 0.42818323 | 0.51699086 |
System Name | Precision | Recall | F1 Score |
Multi-task-multi-dataset learning Mishra 2019, HT’ 19
Slide # 18
10/2/2020
MTL – Multi task Stacked (Layered)
MD – Multi-dataset
MTS – Multi task Shared
S - Single
Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929
Evaluating MTL models Mishra 2019, HT’ 19
19
10/2/2020
Super sense tagging (micro f1)
Part of speech tagging (overall accuracy)
Named entity recognition (micro f1)
Chunking (micro f1)
Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929
Training Mishra 2019, HT’ 19
Slide # 20
10/2/2020
Label embeddings (POS)
Slide # 21
10/2/2020
Label embeddings (NER)
Slide # 22
10/2/2020
Label embeddings (chunking)
Slide # 23
10/2/2020
Label embeddings (super-sense tagging)
Slide # 24
10/2/2020
Label embeddings (super-sense tagging)
Slide # 25
10/2/2020
Web based UI https://github.com/socialmediaie/SocialMediaIE
Slide # 26
10/2/2020
Sentiment classification results https://github.com/socialmediaie/SocialMediaIE
Slide # 27
10/2/2020
Slide # 28
10/2/2020
Uncertainty indicators
Abusive content identification
Label embeddings
Slide # 29
10/2/2020
Web based UI https://github.com/socialmediaie/SocialMediaIE
Slide # 30
10/2/2020
31
10/2/2020
Multilingual transformer models for hate and abusive speech
https://github.com/socialmediaie/TRAC2020 - Fine tuned models at https://huggingface.co/socialmediaie
2nd in 1/6 sub-tasks: ENG A
3rd in 3/6 sub-tasks: HIN A, B, and IBEN B
4th in 1/6 sub-tasks: ENG B
Computationally faster and cheaper inference cost.
Social Media IE: https://github.com/socialmediaie/SocialMediaIE
32
33
Mishra, S., Prasad, S. & Mishra, S. Exploring Multi-Task Multi-Lingual Learning of Transformer Models for Hate Speech and Offensive Speech Identification in Social Media. SN COMPUT. SCI. 2, 72 (2021). https://doi.org/10.1007/s42979-021-00455-5
Incremental learning of text classifiers with human-in-the-loop
Slide # 34
10/2/2020
Mishra, Shubhanshu, Jana Diesner, Jason Byrne, and Elizabeth Surbeck. 2015. “Sentiment Analysis with Incremental Human-in-the-Loop Learning and Lexical Resource Customization.” In Proceedings of the 26th ACM Conference on Hypertext & Social Media - HT ’15, 323–25. New York, New York, USA: ACM Press. https://doi.org/10.1145/2700171.2791022.
Active Learning
Slide # 35
10/2/2020
Mishra et al. (2015)
Slide # 36
10/2/2020
Mishra et al. (2015)
Slide # 37
10/2/2020
Data ordered alphabetically and X and Y axes are not shared.
Slide # 38
10/2/2020
Data ordered alphabetically and X and Y axes are not shared.
List of social media IE tools
39
10/2/2020
Other models for multi-task learning
40
10/2/2020
Digital Social Trace Data (DSTD)
41
42
10/2/2020
Information extraction https://shubhanshu.com/phd_thesis/
“Information Extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources.”
– (Sarawagi, 2008)
Information extraction from semi-structured data
Slide # 43
10/2/2020
However, not all data is unstructured. Many datasets of interest have some inherent structure imposed because of the data generating process.
44
10/2/2020
Digital Social Trace Data https://shubhanshu.com/phd_thesis/
Digital Social Trace Data (DSTD) are digital activity traces generated by individuals as part of a social interactions, such as interactions on social media websites like Twitter, Facebook; or in scientific publications.
Inspired from Digital Trace Data (Howison et. al, 2011)
45
10/2/2020
DSTD properties and examples
Slide # 46
10/2/2020
Property | Social Media | Scholarly data |
Temporal information associated with each item of the data | Tweets ordered by time | Scholarly papers ordered by time |
Presence of connection between various data items | User authors tweets, tweet are quoted in other tweets | Authors connected to papers, papers cite other papers |
Optionally associated meta-data for data items | Likes, retweets, followers, location | Venue, topics, key words |
47
10/2/2020
Information extraction tasks https://shubhanshu.com/phd_thesis
Corpus level
Key-phrase extraction
Taxonomy construction
Topic modelling
Document level
Classification
Token level
Tagging
Disambiguation
Improving sentiment classification using user and tweet metadata
48
Tweet:
What we use
What we discard
Twitter Sentiment Corpora
Sentiment is usually identified as positive, negative, and neutral.
Mishra, S., & Diesner, J. (2018, July 3). Detecting the Correlation between Sentiment and User-level as well as Text-Level Meta-data from Benchmark Corpora. Proceedings of the 29th on Hypertext and Social Media. HT ’18: 29th ACM Conference on Hypertext and Social Media. https://doi.org/10.1145/3209542.3209562
49
Types of metadata and what they quantify
Quantification | User metadata |
Activity level | # Statuses |
Social Interest of the user | # Friends |
Social status | # Followers |
Account age | # days since account creation to posted tweet |
Profile authenticity | Presence of URL on the profile or if the profile is verified |
Quantification | Tweet metadata |
Topical variety | # hashtags |
Reference to sources | # URLs |
Reference to network | # user mentions |
Part of conversation | Is reply |
Reference to conversation | Is quote |
50
User metadata v/s Sentiment
51
Using metadata features can improve sentiment classification
Dataset | Model | Acc. | P | R | F1 | KLD |
Airline | meta | 63.9 | 61.1 | 36.8 | 32.8 | 0.663 |
text | 80.0 | 78.3 | 69.0 | 72.4 | 0.026 | |
joint | 80.3 | 76.6 | 72.0 | 74.0 | 0.005 | |
Clarin | meta | 45.7 | 42.1 | 40.9 | 37.8 | 0.238 |
text | 64.1 | 64.5 | 62.2 | 62.9 | 0.012 | |
joint | 64.1 | 64.0 | 63.0 | 63.4 | 0.000 | |
GOP | meta | 59.9 | 54.3 | 37.5 | 33.6 | 0.776 |
text | 66.4 | 63.7 | 51.4 | 53.6 | 0.111 | |
joint | 65.6 | 59.9 | 56.5 | 57.8 | 0.006 | |
Healthcare | meta | 56.7 | 36.8 | 39.4 | 35.1 | 0.717 |
text | 64.2 | 71.3 | 49.5 | 51.0 | 0.233 | |
joint | 65.6 | 61.6 | 58.3 | 59.5 | 0.007 | |
Obama | meta | 39.3 | 37.0 | 35.1 | 32.0 | 0.282 |
text | 61.5 | 64.8 | 59.7 | 60.9 | 0.030 | |
joint | 62.3 | 63.2 | 61.6 | 62.2 | 0.002 | |
SemEval | meta | 47.0 | 31.0 | 36.2 | 33.0 | 0.845 |
text | 65.5 | 64.1 | 58.0 | 59.5 | 0.032 | |
joint | 65.6 | 62.7 | 60.5 | 61.4 | 0.001 |
Boost in F1 is mostly due to better recall. Precision is lower.
MESC might be helping with tweets with high OOV rates, where text classifiers don’t do well.
Visualizing DSTDs using Social�Communications Temporal Graph
Slide # 52
10/2/2020
Approach
Social Communication Temporal Graphs (SCTG) overcome this issue:
Slide # 53
10/2/2020
Components
Slide # 54
10/2/2020
Visualizing FB groups
Slide # 55
10/2/2020
Slide # 56
10/2/2020
Tweet sentiment over time https://shubhanshu.com/social-comm-temporal-graph/
Visualize temporal network of social media data in your browser
57
10/2/2020
Inspiration from scholarly data using DSTD framework
58
Paper = Set of concepts
Slide # 59
10/2/2020
Title: Geographic assessment of breast cancer screening by towns, zip codes, and census tracts. [PMID: 18019960]
No. of Authors: 7
Year: 2000
Middle Aged
Mass Screening
Humans
Massachusetts
Female
Ultrasonography, Mammary
Breast Neoplasms
Geographic Information Systems
Cluster Analysis
Temporal profile of a concept
Slide # 60
10/2/2020
Burn in
Accelerated growth
Decelerated growth
Constant growth
Age of concept
HIV
Mishra, Shubhanshu, and Vetle I. Torvik. 2016. “Quantifying Conceptual Novelty in the Biomedical Literature.” D-Lib Magazine : The Magazine of the Digital Library Forum 22 (9–10). https://doi.org/10.1045/september2016-mishra.
Conceptual novelty
Motivation
Application
Slide # 61
10/2/2020
Conceptual expertise
Slide # 62
10/2/2020
Mishra, Shubhanshu, Brent D. Fegley, Jana Diesner, and Vetle I. Torvik. 2018. “Expertise as an Aspect of Author Contributions.” In WORKSHOP ON INFORMETRIC AND SCIENTOMETRIC RESEARCH (SIG/MET). Vancouver.
Thank you
63
10/2/2020
References
64
10/2/2020
References
65
10/2/2020