JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 65

Understanding Digital Social Trace Data Via Information Extraction

Shubhanshu Mishra^1*, NLP Researcher

¹ Twitter, Inc.

*Work presented here was done during my PhD at UIUC with multiple collaborators

Content and views expressed in this tutorial are solely the responsibility of the presenter.

PyData MTL - Feb 25, 2021

https://shubhanshu.com/phd_thesis/

https://github.com/socialmediaie

2 of 65

Agenda

Information extraction for Social Media Text: Multi-task, Human-in-the-loop, Multi-lingual
Digital Social Trace Data (DSTD)

10/2/2020

3 of 65

10/2/2020

Information extraction tasks https://shubhanshu.com/phd_thesis

Corpus level

Key-phrase extraction

Taxonomy construction

Topic modelling

Document level

Classification

Sentiment
Hate Speech
Sarcasm
Topic
Spam detection
Relation Extraction

Token level

Tagging

Named entity
Part of speech

Disambiguation

Word Sense
Entity Linking

4 of 65

Information extraction from Social Media Text

10/2/2020

5 of 65

Why social media data is challenging?

Most publicly available models are trained on formally written corpora like one from Reuters, New York Times,

e.g. CoNLL 03 NER dataset is derived from Reuters.

Most annotated NLP datasets are derived from the above formal corpora.
Fewer number of corpora for social media e.g. Facebook, Twitter, Reddit, etc.

6 of 65

Why social media data is challenging?

Social Media text often has a inherent structure, which provides context, e.g.

user mentions
hashtags
comment threads
less formally written language
lot of unseen words
typos, etc.

7 of 65

NER performance difference

Source: Derczynski, L., Maynard, D., Rizzo, G., van Erp, M., Gorrell, G., Troncy, R., Petrak, J., & Bontcheva, K. (2015). Analysis of named entity recognition and linking for tweets. Information Processing & Management, 51(2), 32–49. https://doi.org/10.1016/j.ipm.2014.10.006

8 of 65

We need social media specific models to perform well

9 of 65

Text classification https://github.com/socialmediaie/SocialMediaIE

10/2/2020

10 of 65

Sequence tagging https://github.com/socialmediaie/SocialMediaIE

10/2/2020

11 of 65

Applications of information extraction

Index documents by entities

10/2/2020

DocID	Entity	Entity type	WikiURL
1	Roger Federer	Person	URL1
2	Facebook	Organization	URL2
3	Katy Perry	Music Artist	URL3

12 of 65

Where is the data?

MetaCorpus: A list of curated annotated datasets for various social media tasks and social media platforms. https://github.com/socialmediaie/MetaCorpus
MetaCorpus - benchmark: A selected set of datasets which can be used for benchmarking multi-task learning or NLP for social media data

13 of 65

Tagging data

10/2/2020

Super sense tagging

Part of speech tagging

Named entity recognition

Chunking

14 of 65

Classification data

10/2/2020

Sentiment classification

Abusive content identification

Uncertainty indicator classification

15 of 65

Methods for Extracting Information from Social Media Data

Rules + Models

Multi-task learning

Multilingual learning

Active learning

10/2/2020

16 of 65

Rule based Twitter NER Mishra & Diesner (2016). https://github.com/napsternxg/TwitterNER

10/2/2020

Mishra, Shubhanshu, & Diesner, Jana (2016). Semi-supervised Named Entity Recognition in noisy-text. In Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT) (pp. 203–212). Osaka, Japan: The COLING 2016 Organizing Committee. Retrieved from https://aclweb.org/anthology/papers/W/W16/W16-3927/

17 of 65

Evaluating Twitter NER (F1-score) Mishra & Diesner (2016).

10/2/2020

Rank	TD	TDT_E
10-types	46.4	47.3
No-types	57.3	59.0
company	42.1	46.2
facility	37.5	34.8
geo-loc	70.1	71.0
movie	0.0	0.0
music artist	7.6	5.8
other	31.7	32.4
person	51.3	52.2
product	10.0	9.3
sportsteam	31.3	32.0
tvshow	5.7	5.7

System Name	Precision	Recall	F1 Score
Stanford CoreNLP	0.526838069	0.453416149	0.487377425
Stanford CoreNLP (with Twitter POS tagger)	0.526838069	0.453416149	0.487377425
TwitterNER	0.661496966	0.380822981	0.483370288
OSU NLP	0.524096386	0.405279503	0.45709282
Stanford CoreNLP (with caseless models)	0.547077922	0.392468944	0.457052441
Stanford CoreNLP (with truecasing)	0.413084823	0.421583851	0.417291066
MITIE	0.340364057	0.457298137	0.390260063
spaCy	0.28426543	0.380822981	0.325535092
Polyglot	0.273080661	0.327251553	0.297722055
NLTK	0.149006623	0.331909938	0.205677171
TwitterNER (with Hege training data)	0.657213317	0.413819876	0.507860886
TwitterNER (with W-NUT 2017 training data)	0.675307842	0.404503106	0.505948046
TwitterNER (with Finin training data)	0.598086124	0.388198758	0.470809793
TwitterNER (with W-NUT 2017 and Hege training data)	0.652276759	0.42818323	0.51699086

Source: https://blog.maxar.com/earth-intelligence/2017/named-entity-recognition-for-twitter

Code: https://github.com/humangeo/twitter-ner-eval

System Name	Precision	Recall	F1 Score

18 of 65

Multi-task-multi-dataset learning Mishra 2019, HT’ 19

Slide # 18

10/2/2020

MTL – Multi task Stacked (Layered)

MD – Multi-dataset

MTS – Multi task Shared

S - Single

Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929

19 of 65

Evaluating MTL models Mishra 2019, HT’ 19

10/2/2020

Super sense tagging (micro f1)

Part of speech tagging (overall accuracy)

Named entity recognition (micro f1)

Chunking (micro f1)

20 of 65

Training Mishra 2019, HT’ 19

Sample mini-batches from a task/data
Compute loss for the mini-batch
Individual loss is the log loss for conditional random field
Update the model except the Elmo module
During an epoch go through all tasks and datasets
Train for a max number of epochs
Use early stopping to stop training

Models trained on single datasets have prefix S
Models trained on all datasets of same task have prefix MD
Models trained on all datasets have prefix MTS for multitask models with shared module, and MTL for stacked modules
Models with LR=1e-3 and no L2 regularization have suffix "*"
Models trained without NEEL2016 have suffix "#"

Slide # 20

10/2/2020

21 of 65

Label embeddings (POS)

Slide # 21

10/2/2020

MDMT model learns similarity between labels without this knowledge being encoded in the model
This leads to consistent relationship between similar labels across datasets

22 of 65

Label embeddings (NER)

Slide # 22

10/2/2020

MDMT model learns similarity between labels without this knowledge being encoded in the model
This leads to consistent relationship between similar labels across datasets

23 of 65

Label embeddings (chunking)

Slide # 23

10/2/2020

MDMT model learns similarity between labels without this knowledge being encoded in the model
This leads to consistent relationship between similar labels across datasets

24 of 65

Label embeddings (super-sense tagging)

Slide # 24

10/2/2020

MDMT model learns similarity between labels without this knowledge being encoded in the model
This leads to consistent relationship between similar labels across datasets

25 of 65

Label embeddings (super-sense tagging)

Slide # 25

10/2/2020

MDMT model learns similarity between labels without this knowledge being encoded in the model
This leads to consistent relationship between similar labels across datasets

26 of 65

Web based UI https://github.com/socialmediaie/SocialMediaIE

Slide # 26

10/2/2020

27 of 65

Sentiment classification results https://github.com/socialmediaie/SocialMediaIE

Slide # 27

10/2/2020

28 of 65

Slide # 28

10/2/2020

Uncertainty indicators

Abusive content identification

https://github.com/socialmediaie/SocialMediaIE

29 of 65

Label embeddings

Slide # 29

10/2/2020

https://github.com/socialmediaie/SocialMediaIE

MDMT model learns similarity between labels without this knowledge being encoded in the model
This leads to consistent relationship between similar labels across datasets

30 of 65

Web based UI https://github.com/socialmediaie/SocialMediaIE

Slide # 30

10/2/2020

31 of 65

10/2/2020

Multilingual transformer models for hate and abusive speech

https://github.com/socialmediaie/TRAC2020 - Fine tuned models at https://huggingface.co/socialmediaie

2nd in 1/6 sub-tasks: ENG A

3rd in 3/6 sub-tasks: HIN A, B, and IBEN B

4th in 1/6 sub-tasks: ENG B

Computationally faster and cheaper inference cost.

32 of 65

Social Media IE: https://github.com/socialmediaie/SocialMediaIE

A library for training Multi-task Multi-dataset models
Pre-trained multi-task models for 20 social media classification and tagging tasks
Currently uses ELMo as base embedding layer, transformers supported version planned soon (feel free to contribute)
Returns multi-task outputs as JSON or pandas DataFrame

33 of 65

Mishra, S., Prasad, S. & Mishra, S. Exploring Multi-Task Multi-Lingual Learning of Transformer Models for Hate Speech and Offensive Speech Identification in Social Media. SN COMPUT. SCI. 2, 72 (2021). https://doi.org/10.1007/s42979-021-00455-5

Code: https://github.com/socialmediaie/MTML_HateSpeech

34 of 65

Incremental learning of text classifiers with human-in-the-loop

Given a large unlabeled corpus, can we label it efficiently using fewer human annotations?
Can existing models be updated efficiently to work with new data?
Proposal:

Use active learning for data labeling
Use incremental learning algorithms for model updates

Highly application to social media data:

Streaming data
Model should adapt to new data

Slide # 34

10/2/2020

Mishra, Shubhanshu, Jana Diesner, Jason Byrne, and Elizabeth Surbeck. 2015. “Sentiment Analysis with Incremental Human-in-the-Loop Learning and Lexical Resource Customization.” In Proceedings of the 26th ACM Conference on Hypertext & Social Media - HT ’15, 323–25. New York, New York, USA: ACM Press. https://doi.org/10.1145/2700171.2791022.

35 of 65

Active Learning

Given a model and unlabeled data
Select samples from the unlabeled data to be annotated, based on selection criterion
Update model with collected labeled examples
Repeat steps 2 to 3 till desired accuracy is reached or data exhausted

Slide # 35

10/2/2020

Mishra et al. (2015)

36 of 65

Slide # 36

10/2/2020

Mishra et al. (2015)

37 of 65

Slide # 37

10/2/2020

Each round query 100 samples
Classifier is logistic regression with unigram and lexicon features
Max rounds is 100 (except Clarin)

Data ordered alphabetically and X and Y axes are not shared.

https://github.com/socialmediaie/SocialMediaIE

38 of 65

Slide # 38

10/2/2020

Evaluate only on the data not used for training
Top strategy queries efficiently and can help in labeling full data more quickly.

Data ordered alphabetically and X and Y axes are not shared.

https://github.com/socialmediaie/SocialMediaIE

39 of 65

List of social media IE tools

SocialMediaIE - https://github.com/socialmediaie/SocialMediaIE
TwitterNER - https://github.com/socialmediaie/TwitterNER (more lightweight NER focused on English tweets)
Social Communication Temporal Graph - https://github.com/napsternxg/social-comm-temporal-graph/ (visualizing temporal networks)
ConText - https://github.com/uiuc-ischool-scanr/ConText (generate networks from text data)
SAIL - https://github.com/uiuc-ischool-scanr/SAIL (active learning for text classification, python version coming soon at https://github.com/socialmediaie/)

10/2/2020

40 of 65

Other models for multi-task learning

Hierarchical labels or multi-label settings

Mishra, S., Prasad, S., & Mishra, S. (2020). Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020. In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (pp. 120–125). Marseille, France: European Language Resources Association (ELRA). Retrieved from https://www.aclweb.org/anthology/2020.trac-1.19. Code: https://github.com/socialmediaie/TRAC2020
Mishra, S., & Mishra, S. (2019). 3Idiots at HASOC 2019: Fine-tuning Transformer Neural Networks for Hate Speech Identification in Indo-European Languages. In FIRE (Working Notes) (pp. 208-213). Retrieved from http://ceur-ws.org/Vol-2517/T3-4.pdf. Code: https://github.com/socialmediaie/HASOC2019 �

10/2/2020

41 of 65

Digital Social Trace Data (DSTD)

42 of 65

10/2/2020

Information extraction https://shubhanshu.com/phd_thesis/

“Information Extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources.”

– (Sarawagi, 2008)

43 of 65

Information extraction from semi-structured data

Slide # 43

10/2/2020

However, not all data is unstructured. Many datasets of interest have some inherent structure imposed because of the data generating process.

44 of 65

10/2/2020

Digital Social Trace Data https://shubhanshu.com/phd_thesis/

Digital Social Trace Data (DSTD) are digital activity traces generated by individuals as part of a social interactions, such as interactions on social media websites like Twitter, Facebook; or in scientific publications.

Inspired from Digital Trace Data (Howison et. al, 2011)

45 of 65

10/2/2020

46 of 65

DSTD properties and examples

Slide # 46

10/2/2020

Property	Social Media	Scholarly data
Temporal information associated with each item of the data	Tweets ordered by time	Scholarly papers ordered by time
Presence of connection between various data items	User authors tweets, tweet are quoted in other tweets	Authors connected to papers, papers cite other papers
Optionally associated meta-data for data items	Likes, retweets, followers, location	Venue, topics, key words

47 of 65

10/2/2020

Information extraction tasks https://shubhanshu.com/phd_thesis

Corpus level

Key-phrase extraction

Taxonomy construction

Topic modelling

Document level

Classification

Sentiment
Hate Speech
Sarcasm
Topic
Spam detection
Relation Extraction

Token level

Tagging

Named entity
Part of speech

Disambiguation

Word Sense
Entity Linking

48 of 65

Improving sentiment classification using user and tweet metadata

Tweet:

Text

Sentiment

# of URLs, Hashtags, Mentions
Created at
Retweets
Replies
Is reply or quote?
User:

Created at
# Followers, Friends, Statuses
Is verified or has profile URL?

What we use

What we discard

Twitter Sentiment Corpora

Are our corpora biased to certain meta-data attributes?
Can those biases propagate into systems trained on these corpora?
How correlated are these meta-data features with the annotated sentiment?
Do these correlations hold outside of the annotated data for the same users?
Can sentiment classifiers exploit this bias to do well on these datasets?

Sentiment is usually identified as positive, negative, and neutral.

Mishra, S., & Diesner, J. (2018, July 3). Detecting the Correlation between Sentiment and User-level as well as Text-Level Meta-data from Benchmark Corpora. Proceedings of the 29th on Hypertext and Social Media. HT ’18: 29th ACM Conference on Hypertext and Social Media. https://doi.org/10.1145/3209542.3209562

49 of 65

Types of metadata and what they quantify

Quantification	User metadata
Activity level	# Statuses
Social Interest of the user	# Friends
Social status	# Followers
Account age	# days since account creation to posted tweet
Profile authenticity	Presence of URL on the profile or if the profile is verified

Quantification	Tweet metadata
Topical variety	# hashtags
Reference to sources	# URLs
Reference to network	# user mentions
Part of conversation	Is reply
Reference to conversation	Is quote

50 of 65

User metadata v/s Sentiment

51 of 65

Using metadata features can improve sentiment classification

Dataset	Model	Acc.	P	R	F1	KLD
Airline	meta	63.9	61.1	36.8	32.8	0.663
	text	80.0	78.3	69.0	72.4	0.026
	joint	80.3	76.6	72.0	74.0	0.005
Clarin	meta	45.7	42.1	40.9	37.8	0.238
	text	64.1	64.5	62.2	62.9	0.012
	joint	64.1	64.0	63.0	63.4	0.000
GOP	meta	59.9	54.3	37.5	33.6	0.776
	text	66.4	63.7	51.4	53.6	0.111
	joint	65.6	59.9	56.5	57.8	0.006
Healthcare	meta	56.7	36.8	39.4	35.1	0.717
	text	64.2	71.3	49.5	51.0	0.233
	joint	65.6	61.6	58.3	59.5	0.007
Obama	meta	39.3	37.0	35.1	32.0	0.282
	text	61.5	64.8	59.7	60.9	0.030
	joint	62.3	63.2	61.6	62.2	0.002
SemEval	meta	47.0	31.0	36.2	33.0	0.845
	text	65.5	64.1	58.0	59.5	0.032
	joint	65.6	62.7	60.5	61.4	0.001

Boost in F1 is mostly due to better recall. Precision is lower.

MESC might be helping with tweets with high OOV rates, where text classifiers don’t do well.

52 of 65

Visualizing DSTDs using Social�Communications Temporal Graph

Social communication on social media, community forums, and Wikipedia, is intrinsically networked as well as temporal.
Common ways of visualizing social communication are line or scatter plot, where the reader can see the temporally changing aspect of either posts or comments.
Both, line and scatter plots, hide the connected nature of social communication.

Slide # 52

10/2/2020

53 of 65

Approach

Social Communication Temporal Graphs (SCTG) overcome this issue:

It use two vertically stacked scatter plots.
The points in the top plot are connected to points in the bottom plot.
Both plots share the same time axis.
Each point in either plot can be styled using additional attributes. This adds at-least 3 more dimensions of data to each point by styling their vertical height, point color, and point shape.
Using SCTG, the reader can observe the connected as well as temporal nature of the social communication.

Slide # 53

10/2/2020

54 of 65

Components

Core communication components: This can be a user in a feed or a specific post
Child components: This can be associated posts by a user or comments to a post
Component links: Core communication is liked to its children
Activity timeline: This quantifies the temporal activity measurement
Tool tips: They provide additional data about each component
Component heights, scaling, and color: Visualize additional metadata

Slide # 54

10/2/2020

55 of 65

Visualizing FB groups

Slide # 55

10/2/2020

https://shubhanshu.com/social-comm-temporal-graph/

56 of 65

Slide # 56

10/2/2020

Tweet sentiment over time https://shubhanshu.com/social-comm-temporal-graph/

57 of 65

Visualize temporal network of social media data in your browser

Social Communication Temporal Graph: https://shubhanshu.com/social-comm-temporal-graph/
Recent tweet comparison – Compare user-tweet network on tweets about 2 search queries
Recent Tweet Sentiments – Compare user and tweet level sentiment on tweets about a single search query
Wikipedia Revisions – Compare Wikipedia edit activity across 2 pages and identify common users�

10/2/2020

58 of 65

Inspiration from scholarly data using DSTD framework

59 of 65

Paper = Set of concepts

Slide # 59

10/2/2020

Title: Geographic assessment of breast cancer screening by towns, zip codes, and census tracts. [PMID: 18019960]

No. of Authors: 7

Year: 2000

https://go.illinois.edu/legolas

Middle Aged

Mass Screening

Humans

Massachusetts

Female

Ultrasonography, Mammary

Breast Neoplasms

Geographic Information Systems

Cluster Analysis

60 of 65

Temporal profile of a concept

Slide # 60

10/2/2020

Burn in

Accelerated growth

Decelerated growth

Constant growth

Age of concept

17 years
142K prior papers

HIV

Mishra, Shubhanshu, and Vetle I. Torvik. 2016. “Quantifying Conceptual Novelty in the Biomedical Literature.” D-Lib Magazine : The Magazine of the Digital Library Forum 22 (9–10). https://doi.org/10.1045/september2016-mishra.

61 of 65

Conceptual novelty

Motivation

Identify concept level novelty of articles and authors
Concepts identifies using MeSH and pair of MeSH terms for MEDLINE

Application

Temporal profile of concepts in MEDLINE
Temporal profile of author novelty
Visualizing novelty of paper, author over time: http://abel.ischool.illinois.edu/gimli/

Slide # 61

10/2/2020

62 of 65

Conceptual expertise

Conceptual coverage of a paper
Collaborations:

Which authors contribute?
Does author position give us a clue?

Expertise over time and careers

Slide # 62

10/2/2020

Mishra, Shubhanshu, Brent D. Fegley, Jana Diesner, and Vetle I. Torvik. 2018. “Expertise as an Aspect of Author Contributions.” In WORKSHOP ON INFORMETRIC AND SCIENTOMETRIC RESEARCH (SIG/MET). Vancouver.

63 of 65

Thank you

Questions and tweets at @TheShubhanshu
Some related material presented here can be found at: https://shubhanshu.com/phd_thesis/
If you have questions or feature requests about any of the tools open an issue on github e.g. for SocialMediaIE at: https://github.com/socialmediaie/SocialMediaIE/issues

10/2/2020

64 of 65

References

Mishra, Shubhanshu (2019): Trained models for multi-task multi-dataset learning for text classification as well as sequence tagging in tweets. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1094364_V1
Mishra, Shubhanshu (2019): Trained models for multi-task multi-dataset learning for text classification in tweets. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1917934_V1
Mishra, Shubhanshu (2019): Trained models for multi-task multi-dataset learning for sequence prediction in tweets. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-0934773_V1
Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929

10/2/2020

65 of 65

References

Mishra, Shubhanshu, & Diesner, Jana (2016). Semi-supervised Named Entity Recognition in noisy-text. In Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT) (pp. 203–212). Osaka, Japan: The COLING 2016 Organizing Committee. Retrieved from https://aclweb.org/anthology/papers/W/W16/W16-3927/
Mishra, Shubhanshu, Diesner, Jana, Byrne, Jason, & Surbeck, Elizabeth (2015). Sentiment Analysis with Incremental Human-in-the-Loop Learning and Lexical Resource Customization. In Proceedings of the 26th ACM Conference on Hypertext & Social Media - HT ’15 (pp. 323–325). New York, New York, USA: ACM Press. https://doi.org/10.1145/2700171.2791022

10/2/2020