1 of 59

Computational Text Analysis

Christopher Barrieļæ½

Week 8��

2 of 59

Introduction

3 of 59

Introduction

  • Getting hold of digital data:
    • Web-scraping/screen-scraping
      • APIs

4 of 59

What is web-scraping/screen-scraping

  • Automated collection of content hosted on a screen/webpage
    • Early origins: web crawling
      • Or ā€œweb wandering...ā€

5 of 59

Early examples

World Wide Web Wanderer, Matthew Gray, 1993.

6 of 59

World Wide Web Wanderer Reports

7 of 59

World Wide Web Wanderer Reports

8 of 59

Early examples

JumpStation, Jonathon Fletcher, 1993

9 of 59

And then… on StackOverflow

10 of 59

What is web-scraping/screen-scraping

  • Automated collection of content hosted on a screen/webpage
    • Using programing code
    • Using GUI
    • Scraping: fetches specified content on page or set of pages
    • Crawling: indexes URLs and fetches crawls through them to capture content

Why is it needed?

  • Websites do not contain data in readily usable format
    • They are optimized for screen legibility not data usability

11 of 59

Introduction

  • Getting hold of digital data:
    • Web-scraping/screen-scraping
      • APIs

12 of 59

What are APIs?

  • Acronym stands for ā€œApplication Programming Interfaceā€
    • Commercial origins: allowing other webpages to embed content hosted on other webpage...

13 of 59

Early examples

Ebay API, 2000.

14 of 59

Early examples

Twitter API, 2006.

15 of 59

What are APIs?

  • Allow interaction with web platform and extraction of data in digestible format
  • Were first introduced to allow developers to extend platform use to other domains (e.g., use data to host content on their own platform)
    • So what’s the difference between scraping and using an API?

16 of 59

Scraping versus APIs

  • Scraping:
    • Extracts data from public/visible webpage content
    • Reformats into usable format
  • APIs
    • Extract data from public/non-public and visible/non-visible webpage content
    • Is pre-packaged according to specified query
    • Can be achieved with dedicated libraries

17 of 59

Scraping versus APIs

  • Another way of putting this:ļæ½
    • Scraping gets content hosted on webpage according to identifiers, e.g. CSS tags (more on this later)ļæ½
    • APIs get content from nominally blank webpage, and asks (requests) the platform to fill it in with the content of the request

18 of 59

A hands-on example

  • Scraping Twitter…

😬

19 of 59

A hands-on example

  • Using Twitter API...

šŸ™‚

20 of 59

Where is the data?

21 of 59

Web-scraping

  • Scraping:
    • Potential data sources: universe of webpages in existence: >5bn.

22 of 59

APIs

  • APIs:
    • Potential APIs to use: >22k indexed on: https://www.programmableweb.com/apis

23 of 59

How have these tools been used?

24 of 59

Uses of Web scraping:

Women Are Seen More than Heard in Online Newspapers:

Jia et al. 2016. PLoS One.

Source(s): 2,353,652 news articles from over 950 news outlets

Data and code: ??? 😠

Libraries used: ??? 😠

25 of 59

Uses of Web scraping:

Auditing local news presence on Google News.

Fischer et al. 2016. Nature Human Behaviour.

Source(s): Google News (12.9m results)

Libraries used: Python Selenium

Data and code: https://osf.io/hwuxf/?view_only=3fa7499661df487689031e11b8ea20b4 😊

26 of 59

Uses of Web scraping:

Human language reveals a universal positivity bias.

Dodds et al. 2015. PNAS.

Source(s): Google Web Crawl (1T 5-grams)

Libraries used: pre-packaged data

Data and code: http://www.uvm.edu/storylab/share/papers/dodds2014a/data.html 😊

(see also) Response: https://www.pnas.org/content/112/23/E2983

27 of 59

Uses of Web scraping:

ā€œNo Fracking Way!ā€.

Vasi et al. 2015. American Sociological Review.

Source(s): MarcellusProtest.org

Data and code: ??? 😠

Libraries used: ??? 😠

28 of 59

Uses of APIs:

Monitoring global digital gender inequality using the online populations of Facebook and Google. Kashyap et al. 2020. Demographic Research.

APIs used: Google AdWords; Facebook Marketing

Data and code: https://www.demographic-research.org/volumes/vol43/27/ 😊

29 of 59

Uses of APIs:

Right-Wing YouTube: A Supply and Demand Perspective.

Munger and Phillips. 2020. International Journal of Press/Politics.

APIs used: YouTube API

Libaries used: Python ā€œyoutube-data-apiā€

Data and code: https://github.com/kmunger/YT_descriptive 😊

30 of 59

Uses of APIs:

Parents mention sons more often than daughters on social media.

Sivak and Smirnov. 2019. PNAS.

APIs used: VKontakte API

Libaries used: ??? 😠

Data: https://osf.io/4ncbu/ 😊

Code: ??? 😠

31 of 59

Uses of APIs:

Evidence from internet search data shows information-seeking responses. 2020. PNAS.

APIs used: Google Health Trends API

Libaries used: Author-written and Python apiclient

Data and code: https://github.com/anabento/GoogleBehaviorCovid 😊

32 of 59

APIs worth researching

Or search ā€œAPIā€ in https://cran.r-project.org/web/packages/available_packages_by_name.html

Or search API directory in Programmable Web: https://www.programmableweb.com/apis/directory

Or work through this collaborative online review of APIs, written in R: https://bookdown.org/paul/apis_for_social_scientists/

Or https://www.postman.com

33 of 59

Before we try our hand at all this...

34 of 59

Design considerations...

35 of 59

Unit selection

Sampling

Inference

  • Who or what do we want to study?
  • Over what time frame?
  • From where is the data coming?
  • Are there biases in the data generating process?
  • To what population do findings relate?
  • To what phenomenon do our data speak?

36 of 59

By way of example...

Individuals with depression express more distorted thinking on social media.

Bathina et al. 2021. Nature Human Behaviour.

Finding: depressed people more likely to exhibit distorted thinking

37 of 59

38 of 59

39 of 59

40 of 59

Unit selection

Sampling

Inference

  • (Self-reported) depressed individuals online
  • Last 3,200 tweets (i.e., time period varies)
  • Twitter
  • Biased by age, urban/rural, gender
  • API and time-period bias
  • Depressed people? Depressed people online?
  • Cognitive distortion? Expression of cognitive distortion online?

41 of 59

The legal bit…

42 of 59

Law and Data Scraping

  • Multiple considerations:
    • Is content copyrighted?; does scraping infringe IP rights?
    • Are there ā€œterms of useā€ or ā€œterms of serviceā€ for the platform?
    • Is the use non-commercial?
      • If so: is the website hosted in UK? A 2014 UK law and 2019 EU directive gives exemption for non-commercial research
  • Good practice:
    • Contact webpage owners in advance of scraping to prewarn
    • Optimize scraper to minimize demand
    • Run scraper at night or when webpage receives least traffic
    • See e.g. the ONS policy

43 of 59

The Case Law…

  • Many cases have been brought (though not against researchers) for uses of online data in violation of ToS.

AOIR Ethics 3.0 Guidelines: https://aoir.org/ethics/

CREATe Report ā€œLaw of Data Scrapingā€: https://zenodo.org/record/4635759#.Yd7C6sanxf1

44 of 59

45 of 59

A basic decision tree

46 of 59

A basic decision tree

For more see: Matt Salganik, Bit by Bit: Social Research in the Digital Age. 2018. ch.6.

But: ethical questions intervene at each step. Legal ≠ ethical

47 of 59

Law

Ethics

  • Legal constraints placed by platforms on accessing content
  • E.g., not to use scrapers/crawlers
  • Ethical protection of user privacy/violations of contextual integrity
  • Protection of minors/vulnerable groups

48 of 59

ā€œBy employing TOS- compliant methods, you are respecting the business prerogatives of the company that created the platform you are studying, but you may or may not be respecting the dignity and privacy of the platform’s usersā€

49 of 59

ā€œOn the one hand, a purely legal argument may be mounted that in some circumstances the benefits to society from breaching the terms of service outweigh the detriments to the platform itself, or to specific users… [o]n the other hand, this is also the legal expression of a much broader moral argument that points to the need for independent, critical, public-interest scrutiny of social media spaces that are now critical to public communication across many societies..ā€

50 of 59

ā€œA ā€˜regulation optimistic’ approach: we need a combination of data access options that are widely available to anyone with no restrictions, but that only cover very limited offerings, and more comprehensive options that allow more in-depth research, but with more specific preconditions… Only if one assumes that privacy is merely a convenient excuse to stifle critical research can this problem be ignored.ā€

51 of 59

ā€œI don’t think researchers should not be automatically bound by such terms-of-service agreements. Ideally, if researchers violate terms- of-service agreements, they should explain their decision openly… as suggested by transparency-based accountability. But this openness may expose researchers to added legal risk; in the United States, for example, the Computer Fraud and Abuse Act may make it illegal to violate terms-of-service agreementsā€¦ā€

52 of 59

Private agreement as alternative?

  • Social Science One flagship example of this
  • Advantages
    • Enhanced data access; bespoke for researcher needs; avoids unannounced access changes (APIs not intended for research needs after all…); is better than walking away (?!)
  • Shortcomings
    • Prohibitively demanding for independent researchers; concedes too much (?); means participating in corporate reputation washing (?)

53 of 59

OCR

  • Stands for ā€œOptical Character Recognitionā€
  • Extracts text from images
    • E.g…:

54 of 59

55 of 59

Speech to text

  • ML engines to convert audio to text (e.g., for lectures and YouTube)
  • Google Cloud Engine Speech to Text
    • Has python support
  • Or… capture pre-genned YouTube captions
    • Using e.g.

56 of 59

57 of 59

The course book...

58 of 59

59 of 59

Thanks!