1 of 65

Understanding Digital Social Trace Data Via Information Extraction

Shubhanshu Mishra1*, NLP Researcher

1 Twitter, Inc.

*Work presented here was done during my PhD at UIUC with multiple collaborators

Content and views expressed in this tutorial are solely the responsibility of the presenter.

PyData MTL - Feb 25, 2021

https://shubhanshu.com/phd_thesis/

https://github.com/socialmediaie

2 of 65

Agenda

  • Information extraction for Social Media Text: Multi-task, Human-in-the-loop, Multi-lingual
  • Digital Social Trace Data (DSTD)

2

10/2/2020

3 of 65

3

10/2/2020

Information extraction tasks https://shubhanshu.com/phd_thesis

Corpus level

Key-phrase extraction

Taxonomy construction

Topic modelling

Document level

Classification

    • Sentiment
    • Hate Speech
    • Sarcasm
    • Topic
    • Spam detection
    • Relation Extraction

Token level

Tagging

    • Named entity
    • Part of speech

Disambiguation

    • Word Sense
    • Entity Linking

4 of 65

Information extraction from Social Media Text

4

10/2/2020

5 of 65

Why social media data is challenging?

  • Most publicly available models are trained on formally written corpora like one from Reuters, New York Times,

e.g. CoNLL 03 NER dataset is derived from Reuters.

  • Most annotated NLP datasets are derived from the above formal corpora.
  • Fewer number of corpora for social media e.g. Facebook, Twitter, Reddit, etc.

5

6 of 65

Why social media data is challenging?

Social Media text often has a inherent structure, which provides context, e.g.

  • user mentions
  • hashtags
  • comment threads
  • less formally written language
  • lot of unseen words
  • typos, etc.

6

7 of 65

NER performance difference

7

Source: Derczynski, L., Maynard, D., Rizzo, G., van Erp, M., Gorrell, G., Troncy, R., Petrak, J., & Bontcheva, K. (2015). Analysis of named entity recognition and linking for tweets. Information Processing & Management, 51(2), 32–49. https://doi.org/10.1016/j.ipm.2014.10.006

8 of 65

We need social media specific models to perform well

8

9 of 65

9

10/2/2020

10 of 65

10

10/2/2020

11 of 65

Applications of information extraction

Index documents by entities

11

10/2/2020

DocID

Entity

Entity type

WikiURL

1

Roger Federer

Person

URL1

2

Facebook

Organization

URL2

3

Katy Perry

Music Artist

URL3

12 of 65

Where is the data?

  • MetaCorpus: A list of curated annotated datasets for various social media tasks and social media platforms. https://github.com/socialmediaie/MetaCorpus
  • MetaCorpus - benchmark: A selected set of datasets which can be used for benchmarking multi-task learning or NLP for social media data

12

13 of 65

Tagging data

13

10/2/2020

Super sense tagging

Part of speech tagging

Named entity recognition

Chunking

14 of 65

Classification data

14

10/2/2020

Sentiment classification

Abusive content identification

Uncertainty indicator classification

15 of 65

Methods for Extracting Information from Social Media Data

Rules + Models

Multi-task learning

Multilingual learning

Active learning

15

10/2/2020

16 of 65

Rule based Twitter NER Mishra & Diesner (2016). https://github.com/napsternxg/TwitterNER

16

10/2/2020

Mishra, Shubhanshu, & Diesner, Jana (2016). Semi-supervised Named Entity Recognition in noisy-text. In Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT) (pp. 203–212). Osaka, Japan: The COLING 2016 Organizing Committee. Retrieved from https://aclweb.org/anthology/papers/W/W16/W16-3927/  

17 of 65

Evaluating Twitter NER (F1-score) Mishra & Diesner (2016).

17

10/2/2020

Rank

TD

TDTE

10-types

46.4

47.3

No-types

57.3

59.0

company

42.1

46.2

facility

37.5

34.8

geo-loc

70.1

71.0

movie

0.0

0.0

music artist

7.6

5.8

other

31.7

32.4

person

51.3

52.2

product

10.0

9.3

sportsteam

31.3

32.0

tvshow

5.7

5.7

System Name

Precision

Recall

F1 Score

Stanford CoreNLP

0.526838069

0.453416149

0.487377425

Stanford CoreNLP (with Twitter POS tagger)

0.526838069

0.453416149

0.487377425

TwitterNER

0.661496966

0.380822981

0.483370288

OSU NLP

0.524096386

0.405279503

0.45709282

Stanford CoreNLP (with caseless models)

0.547077922

0.392468944

0.457052441

Stanford CoreNLP (with truecasing)

0.413084823

0.421583851

0.417291066

MITIE

0.340364057

0.457298137

0.390260063

spaCy

0.28426543

0.380822981

0.325535092

Polyglot

0.273080661

0.327251553

0.297722055

NLTK

0.149006623

0.331909938

0.205677171

TwitterNER (with Hege training data)

0.657213317

0.413819876

0.507860886

TwitterNER (with W-NUT 2017 training data)

0.675307842

0.404503106

0.505948046

TwitterNER (with Finin training data)

0.598086124

0.388198758

0.470809793

TwitterNER (with W-NUT 2017 and Hege training data)

0.652276759

0.42818323

0.51699086

System Name

Precision

Recall

F1 Score

18 of 65

Multi-task-multi-dataset learning Mishra 2019, HT’ 19

Slide # 18

10/2/2020

MTL – Multi task Stacked (Layered)

MD – Multi-dataset

MTS – Multi task Shared

S - Single

Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929

19 of 65

Evaluating MTL models Mishra 2019, HT’ 19

19

10/2/2020

Super sense tagging (micro f1)

Part of speech tagging (overall accuracy)

Named entity recognition (micro f1)

Chunking (micro f1)

Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929

20 of 65

Training Mishra 2019, HT’ 19

  • Sample mini-batches from a task/data
  • Compute loss for the mini-batch
  • Individual loss is the log loss for conditional random field
  • Update the model except the Elmo module
  • During an epoch go through all tasks and datasets
  • Train for a max number of epochs
  • Use early stopping to stop training
  • Models trained on single datasets have prefix S
  • Models trained on all datasets of same task have prefix MD
  • Models trained on all datasets have prefix MTS for multitask models with shared module, and MTL for stacked modules
  • Models with LR=1e-3 and no L2 regularization have suffix "*"
  • Models trained without NEEL2016 have suffix "#"

Slide # 20

10/2/2020

21 of 65

Label embeddings (POS)

Slide # 21

10/2/2020

  • MDMT model learns similarity between labels without this knowledge being encoded in the model
  • This leads to consistent relationship between similar labels across datasets

22 of 65

Label embeddings (NER)

Slide # 22

10/2/2020

  • MDMT model learns similarity between labels without this knowledge being encoded in the model
  • This leads to consistent relationship between similar labels across datasets

23 of 65

Label embeddings (chunking)

Slide # 23

10/2/2020

  • MDMT model learns similarity between labels without this knowledge being encoded in the model
  • This leads to consistent relationship between similar labels across datasets

24 of 65

Label embeddings (super-sense tagging)

Slide # 24

10/2/2020

  • MDMT model learns similarity between labels without this knowledge being encoded in the model
  • This leads to consistent relationship between similar labels across datasets

25 of 65

Label embeddings (super-sense tagging)

Slide # 25

10/2/2020

  • MDMT model learns similarity between labels without this knowledge being encoded in the model
  • This leads to consistent relationship between similar labels across datasets

26 of 65

Slide # 26

10/2/2020

27 of 65

Sentiment classification results https://github.com/socialmediaie/SocialMediaIE

Slide # 27

10/2/2020

28 of 65

Slide # 28

10/2/2020

Uncertainty indicators

Abusive content identification

29 of 65

Label embeddings

Slide # 29

10/2/2020

  • MDMT model learns similarity between labels without this knowledge being encoded in the model
  • This leads to consistent relationship between similar labels across datasets

30 of 65

Slide # 30

10/2/2020

31 of 65

31

10/2/2020

Multilingual transformer models for hate and abusive speech

https://github.com/socialmediaie/TRAC2020 - Fine tuned models at https://huggingface.co/socialmediaie

2nd in 1/6 sub-tasks: ENG A

3rd in 3/6 sub-tasks: HIN A, B, and IBEN B

4th in 1/6 sub-tasks: ENG B

Computationally faster and cheaper inference cost.

32 of 65

  • A library for training Multi-task Multi-dataset models
  • Pre-trained multi-task models for 20 social media classification and tagging tasks
  • Currently uses ELMo as base embedding layer, transformers supported version planned soon (feel free to contribute)
  • Returns multi-task outputs as JSON or pandas DataFrame

32

33 of 65

33

Mishra, S., Prasad, S. & Mishra, S. Exploring Multi-Task Multi-Lingual Learning of Transformer Models for Hate Speech and Offensive Speech Identification in Social Media. SN COMPUT. SCI. 2, 72 (2021). https://doi.org/10.1007/s42979-021-00455-5

Code: https://github.com/socialmediaie/MTML_HateSpeech

34 of 65

Incremental learning of text classifiers with human-in-the-loop

  • Given a large unlabeled corpus, can we label it efficiently using fewer human annotations?
  • Can existing models be updated efficiently to work with new data?
  • Proposal:
    • Use active learning for data labeling
    • Use incremental learning algorithms for model updates
  • Highly application to social media data:
    • Streaming data
    • Model should adapt to new data

Slide # 34

10/2/2020

Mishra, Shubhanshu, Jana Diesner, Jason Byrne, and Elizabeth Surbeck. 2015. “Sentiment Analysis with Incremental Human-in-the-Loop Learning and Lexical Resource Customization.” In Proceedings of the 26th ACM Conference on Hypertext & Social Media - HT ’15, 323–25. New York, New York, USA: ACM Press. https://doi.org/10.1145/2700171.2791022.

35 of 65

Active Learning

  1. Given a model and unlabeled data
  2. Select samples from the unlabeled data to be annotated, based on selection criterion
  3. Update model with collected labeled examples
  4. Repeat steps 2 to 3 till desired accuracy is reached or data exhausted

Slide # 35

10/2/2020

Mishra et al. (2015)

36 of 65

Slide # 36

10/2/2020

Mishra et al. (2015)

37 of 65

Slide # 37

10/2/2020

  • Each round query 100 samples
  • Classifier is logistic regression with unigram and lexicon features
  • Max rounds is 100 (except Clarin)

Data ordered alphabetically and X and Y axes are not shared.

38 of 65

Slide # 38

10/2/2020

  • Evaluate only on the data not used for training
  • Top strategy queries efficiently and can help in labeling full data more quickly.

Data ordered alphabetically and X and Y axes are not shared.

39 of 65

List of social media IE tools

39

10/2/2020

40 of 65

Other models for multi-task learning

  • Hierarchical labels or multi-label settings
    • Mishra, S., Prasad, S., & Mishra, S. (2020). Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020. In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (pp. 120–125). Marseille, France: European Language Resources Association (ELRA). Retrieved from https://www.aclweb.org/anthology/2020.trac-1.19. Code: https://github.com/socialmediaie/TRAC2020
    • Mishra, S., & Mishra, S. (2019). 3Idiots at HASOC 2019: Fine-tuning Transformer Neural Networks for Hate Speech Identification in Indo-European Languages. In FIRE (Working Notes) (pp. 208-213). Retrieved from http://ceur-ws.org/Vol-2517/T3-4.pdf. Code: https://github.com/socialmediaie/HASOC2019

40

10/2/2020

41 of 65

Digital Social Trace Data (DSTD)

41

42 of 65

42

10/2/2020

Information extraction https://shubhanshu.com/phd_thesis/

“Information Extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources.”

– (Sarawagi, 2008)

43 of 65

Information extraction from semi-structured data

Slide # 43

10/2/2020

However, not all data is unstructured. Many datasets of interest have some inherent structure imposed because of the data generating process.

44 of 65

44

10/2/2020

Digital Social Trace Data https://shubhanshu.com/phd_thesis/

Digital Social Trace Data (DSTD) are digital activity traces generated by individuals as part of a social interactions, such as interactions on social media websites like Twitter, Facebook; or in scientific publications.

Inspired from Digital Trace Data (Howison et. al, 2011)

45 of 65

45

10/2/2020

46 of 65

DSTD properties and examples

Slide # 46

10/2/2020

Property

Social Media

Scholarly data

Temporal information associated with each item of the data

Tweets ordered by time

Scholarly papers ordered by time

Presence of connection between various data items

User authors tweets, tweet are quoted in other tweets

Authors connected to papers, papers cite other papers

Optionally associated meta-data for data items

Likes, retweets, followers, location

Venue, topics, key words

47 of 65

47

10/2/2020

Information extraction tasks https://shubhanshu.com/phd_thesis

Corpus level

Key-phrase extraction

Taxonomy construction

Topic modelling

Document level

Classification

    • Sentiment
    • Hate Speech
    • Sarcasm
    • Topic
    • Spam detection
    • Relation Extraction

Token level

Tagging

    • Named entity
    • Part of speech

Disambiguation

    • Word Sense
    • Entity Linking

48 of 65

Improving sentiment classification using user and tweet metadata

48

Tweet:

  • Text
    • Sentiment
  • # of URLs, Hashtags, Mentions
  • Created at
  • Retweets
  • Replies
  • Is reply or quote?
  • User:
    • Created at
    • # Followers, Friends, Statuses
    • Is verified or has profile URL?

What we use

What we discard

Twitter Sentiment Corpora

  • Are our corpora biased to certain meta-data attributes?
  • Can those biases propagate into systems trained on these corpora?
  • How correlated are these meta-data features with the annotated sentiment?
  • Do these correlations hold outside of the annotated data for the same users?
  • Can sentiment classifiers exploit this bias to do well on these datasets?

Sentiment is usually identified as positive, negative, and neutral.

Mishra, S., & Diesner, J. (2018, July 3). Detecting the Correlation between Sentiment and User-level as well as Text-Level Meta-data from Benchmark Corpora. Proceedings of the 29th on Hypertext and Social Media. HT ’18: 29th ACM Conference on Hypertext and Social Media. https://doi.org/10.1145/3209542.3209562

49 of 65

49

Types of metadata and what they quantify

Quantification

User metadata

Activity level

# Statuses

Social Interest of the user

# Friends

Social status

# Followers

Account age

# days since account creation to posted tweet

Profile authenticity

Presence of URL on the profile or if the profile is verified

Quantification

Tweet metadata

Topical variety

# hashtags

Reference to sources

# URLs

Reference to network

# user mentions

Part of conversation

Is reply

Reference to conversation

Is quote

50 of 65

50

User metadata v/s Sentiment

51 of 65

51

Using metadata features can improve sentiment classification

Dataset

Model

Acc.

P

R

F1

KLD

Airline

meta

63.9

61.1

36.8

32.8

0.663

text

80.0

78.3

69.0

72.4

0.026

joint

80.3

76.6

72.0

74.0

0.005

Clarin

meta

45.7

42.1

40.9

37.8

0.238

text

64.1

64.5

62.2

62.9

0.012

joint

64.1

64.0

63.0

63.4

0.000

GOP

meta

59.9

54.3

37.5

33.6

0.776

text

66.4

63.7

51.4

53.6

0.111

joint

65.6

59.9

56.5

57.8

0.006

Healthcare

meta

56.7

36.8

39.4

35.1

0.717

text

64.2

71.3

49.5

51.0

0.233

joint

65.6

61.6

58.3

59.5

0.007

Obama

meta

39.3

37.0

35.1

32.0

0.282

text

61.5

64.8

59.7

60.9

0.030

joint

62.3

63.2

61.6

62.2

0.002

SemEval

meta

47.0

31.0

36.2

33.0

0.845

text

65.5

64.1

58.0

59.5

0.032

joint

65.6

62.7

60.5

61.4

0.001

Boost in F1 is mostly due to better recall. Precision is lower.

MESC might be helping with tweets with high OOV rates, where text classifiers don’t do well.

52 of 65

Visualizing DSTDs using Social�Communications Temporal Graph

  • Social communication on social media, community forums, and Wikipedia, is intrinsically networked as well as temporal.
  • Common ways of visualizing social communication are line or scatter plot, where the reader can see the temporally changing aspect of either posts or comments.
  • Both, line and scatter plots, hide the connected nature of social communication.

Slide # 52

10/2/2020

53 of 65

Approach

Social Communication Temporal Graphs (SCTG) overcome this issue:

  • It use two vertically stacked scatter plots.
  • The points in the top plot are connected to points in the bottom plot.
  • Both plots share the same time axis.
  • Each point in either plot can be styled using additional attributes. This adds at-least 3 more dimensions of data to each point by styling their vertical height, point color, and point shape.
  • Using SCTG, the reader can observe the connected as well as temporal nature of the social communication.

Slide # 53

10/2/2020

54 of 65

Components

  • Core communication components: This can be a user in a feed or a specific post
  • Child components: This can be associated posts by a user or comments to a post
  • Component links: Core communication is liked to its children
  • Activity timeline: This quantifies the temporal activity measurement
  • Tool tips: They provide additional data about each component
  • Component heights, scaling, and color: Visualize additional metadata

Slide # 54

10/2/2020

55 of 65

Visualizing FB groups

Slide # 55

10/2/2020

56 of 65

Slide # 56

10/2/2020

57 of 65

Visualize temporal network of social media data in your browser

57

10/2/2020

58 of 65

Inspiration from scholarly data using DSTD framework

58

59 of 65

Paper = Set of concepts

Slide # 59

10/2/2020

Title: Geographic assessment of breast cancer screening by towns, zip codes, and census tracts. [PMID: 18019960]

No. of Authors: 7

Year: 2000

https://go.illinois.edu/legolas

Middle Aged

Mass Screening

Humans

Massachusetts

Female

Ultrasonography, Mammary

Breast Neoplasms

Geographic Information Systems

Cluster Analysis

60 of 65

Temporal profile of a concept

Slide # 60

10/2/2020

Burn in

Accelerated growth

Decelerated growth

Constant growth

Age of concept

  • 17 years
  • 142K prior papers

HIV

Mishra, Shubhanshu, and Vetle I. Torvik. 2016. “Quantifying Conceptual Novelty in the Biomedical Literature.” D-Lib Magazine : The Magazine of the Digital Library Forum 22 (9–10). https://doi.org/10.1045/september2016-mishra.

61 of 65

Conceptual novelty

Motivation

  • Identify concept level novelty of articles and authors
  • Concepts identifies using MeSH and pair of MeSH terms for MEDLINE

Application

  • Temporal profile of concepts in MEDLINE
  • Temporal profile of author novelty
  • Visualizing novelty of paper, author over time: http://abel.ischool.illinois.edu/gimli/

Slide # 61

10/2/2020

62 of 65

Conceptual expertise

  • Conceptual coverage of a paper
  • Collaborations:
    • Which authors contribute?
    • Does author position give us a clue?
  • Expertise over time and careers

Slide # 62

10/2/2020

Mishra, Shubhanshu, Brent D. Fegley, Jana Diesner, and Vetle I. Torvik. 2018. “Expertise as an Aspect of Author Contributions.” In WORKSHOP ON INFORMETRIC AND SCIENTOMETRIC RESEARCH (SIG/MET). Vancouver.

63 of 65

Thank you

63

10/2/2020

64 of 65

References

  • Mishra, Shubhanshu (2019): Trained models for multi-task multi-dataset learning for text classification as well as sequence tagging in tweets. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1094364_V1 
  • Mishra, Shubhanshu (2019): Trained models for multi-task multi-dataset learning for text classification in tweets. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1917934_V1
  • Mishra, Shubhanshu (2019): Trained models for multi-task multi-dataset learning for sequence prediction in tweets. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-0934773_V1 
  • Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929

64

10/2/2020

65 of 65

References

  • Mishra, Shubhanshu, & Diesner, Jana (2016). Semi-supervised Named Entity Recognition in noisy-text. In Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT) (pp. 203–212). Osaka, Japan: The COLING 2016 Organizing Committee. Retrieved from https://aclweb.org/anthology/papers/W/W16/W16-3927/  
  • Mishra, Shubhanshu, Diesner, Jana, Byrne, Jason, & Surbeck, Elizabeth (2015). Sentiment Analysis with Incremental Human-in-the-Loop Learning and Lexical Resource Customization. In Proceedings of the 26th ACM Conference on Hypertext & Social Media - HT ’15 (pp. 323–325). New York, New York, USA: ACM Press. https://doi.org/10.1145/2700171.2791022 

65

10/2/2020