1 of 34

Computational Text Analysis

Christopher Barrie�

Week 9�

2 of 34

Introduction

3 of 34

Supervised learning

  • You’ve tried dictionary-based methods
    • And that’s kind of like classifying…

4 of 34

5 of 34

An example

  • Denominating by totals
    • What if we also had lots of words indicating other types of sentiment.
      • E.g., a typical phrase like “I was furious to be at a lecture at 9AM but I was filled with utter joy and elation upon entering the classroom…”
      • And then we counted up words denoting sentiment…?
      • Ideas?

6 of 34

An example

  • Denominating by totals
    • “I was furious to be at a lecture at 9AM but I was filled with utter joy and elation upon entering the classroom…”
      • + 2 “happy” words
      • + 1 “unhappy” words
      • + 23 total words
    • 2/23 = .087…
    • 1/12 = .044…

7 of 34

Supervised learning

  • You’ve tried dictionary-based methods
    • And that’s kind of like classifying…
  • We have:
    • 1. Some rule according to which we’re classifying (supervised)
    • 2. Some output unit of analysis we’re targeting

8 of 34

Supervised learning

  • You’ve tried word embedding methods
    • And that’s also useful when classifying…

9 of 34

I like doing text analysis with Chris

I like doing text analysis with Chris

I like doing text analysis with Chris

I like doing text analysis with Chris

I like doing text analysis with Chris

I like doing text analysis with Chris

I like doing text analysis with Chris

10 of 34

Word embeddings

  • Context window: how many words around the target word we are counting
  • Co-occurence: for any two words, the number of times they appear together in a context window
  • Words in red = target words
  • Words in blue = context words

I like doing text analysis with Chris

I like doing text analysis with Chris

I like doing text analysis with Chris

I like doing text analysis with Chris

I like doing text analysis with Chris

I like doing text analysis with Chris

I like doing text analysis with Chris

11 of 34

Word embeddings

  • How does it work?
  • We count up the co-occurrences of words over our pre-specified context window (often ~ 6 words)

I like doing text analysis with Chris

I like doing text analysis with Chris

I like doing text analysis with Chris

I like doing text analysis with Chris

I like doing text analysis with Chris

I like doing text analysis with Chris

I like doing text analysis with Chris

I like doing text analysis with Chris

I 0 1 1 0 0 0 0

like 0 0 1 1 0 0 0

doin 0 0 0 1 1 0 0

text 0 0 0 0 1 1 0

an… 0 0 0 0 0 1 1

with 0 0 0 0 0 0 1

Chris 0 0 0 0 0 0 0

Context window = 2

12 of 34

Word embeddings

  • Now imagine what this would look like for a whole book
    • many dimensions!
    • matrix of dimensions V x V where V is vocabulary of corpus
    • e.g…

13 of 34

Supervised learning

  • You’ve tried word embedding methods
    • And that’s also useful when classifying…
  • We can use DTMs as input for classification algorithm
  • We can also use embeddings as input (not covered here)

14 of 34

15 of 34

Supervised learning

  • Ways of approaching supervised learning problems:
    • Train your own model from step 1 using standard algo. (Naive Bayes, Random Forest etc.)
    • Classify using some pre-packaged engine (e.g., Perspective)
    • Classify by fine-tuning some Transformer-based architecture
    • Classify in zero- or few-shot setting

16 of 34

An example

Trumping Hate on Twitter? Online Hate Speech in the 2016 U.S. Election Campaign and its Aftermath. Siegel et al. 2021. Quarterly Journal of Political Science

17 of 34

An example

Trumping Hate on Twitter? Online Hate Speech in the 2016 U.S. Election Campaign and its Aftermath. Siegel et al. 2021. Quarterly Journal of Political Science

18 of 34

Recent advances

19 of 34

An example

Attention Is All You Need. Vaswani et al. 2017. NeurIPS

20 of 34

An example

Politics as Usual? Measuring Populism, Nationalism, and Authoritarianism in U.S. Presidential Campaigns (1952–2020) with Neural Language Models. Bonikowski et al. 2022. Sociological Methods and Research

21 of 34

An example

Politics as Usual? Measuring Populism, Nationalism, and Authoritarianism in U.S. Presidential Campaigns (1952–2020) with Neural Language Models. Bonikowski et al. 2022. Sociological Methods and Research

22 of 34

An example

Politics as Usual? Measuring Populism, Nationalism, and Authoritarianism in U.S. Presidential Campaigns (1952–2020) with Neural Language Models. Bonikowski et al. 2022. Sociological Methods and Research

23 of 34

Transformer-based models

  • Transformers:
    • Take pre-trained model
    • Fine tune with classification head (labelled data)
    • Higher performance
    • Higher speed
    • Reproducible and accessible: https://huggingface.co/docs/transformers/model_doc/bert

24 of 34

An example

Language Models are Few Shot Learners. Brown et al. 2020. NeurIPS

25 of 34

Foundation models

  • LLMs:
    • Take pre-trained model
    • Add a prompt or series of examples
    • Higher performance (versus Transformers) on some tasks
    • Question marks over accessibility and reproducibility
    • An example: https://chat.openai.com/share/657b645a-f166-4457-8f16-32a9ecbf1437

26 of 34

An example

ChatGPT outperforms crowd workers for text-annotation tasks. Gilardii et al. 2023. PNAS

27 of 34

An example

ChatGPT outperforms crowd workers for text-annotation tasks. Gilardii et al. 2023. PNAS

28 of 34

An example

ChatGPT outperforms crowd workers for text-annotation tasks. Gilardii et al. 2023. PNAS

29 of 34

Some terminology…

30 of 34

Supervised learning

Validation -

  • Often relies on human coders
    • Accuracy: % correctly classified
    • Recall: true positive/true positive + false negative
    • Precision: true positive/true positive + false positive
    • ROC (receiver operating characteristic) curves: true positive rate (i.e., recall a.k.a. sensitivity) versus false positive rate (i.e., false alarm rate)

31 of 34

32 of 34

The course book...

33 of 34

34 of 34

Thanks!