1 of 59

Computational Text Analysis

Christopher Barrie�

Week 8��

2 of 59

Introduction

3 of 59

Introduction

Getting hold of digital data:

Web-scraping/screen-scraping

APIs

4 of 59

What is web-scraping/screen-scraping

Automated collection of content hosted on a screen/webpage

Early origins: web crawling

Or “web wandering...”

5 of 59

Early examples

World Wide Web Wanderer, Matthew Gray, 1993.

6 of 59

World Wide Web Wanderer Reports

See: https://www.mit.edu/~mkgray/net/web-growth-summary.html

7 of 59

World Wide Web Wanderer Reports

See: https://www.mit.edu/~mkgray/net/web-growth-summary.html

8 of 59

Early examples

JumpStation, Jonathon Fletcher, 1993

See: https://tedium.co/2019/09/05/jumpstation-search-history/ and

https://www.bbc.com/news/technology-23945326

9 of 59

And then… on StackOverflow

10 of 59

What is web-scraping/screen-scraping

Automated collection of content hosted on a screen/webpage

Using programing code
Using GUI
Scraping: fetches specified content on page or set of pages
Crawling: indexes URLs and fetches crawls through them to capture content

Why is it needed?

Websites do not contain data in readily usable format

They are optimized for screen legibility not data usability

11 of 59

Introduction

Getting hold of digital data:

Web-scraping/screen-scraping

APIs

12 of 59

What are APIs?

Acronym stands for “Application Programming Interface”

Commercial origins: allowing other webpages to embed content hosted on other webpage...

13 of 59

Early examples

Ebay API, 2000.

See: https://apievangelist.com/2012/12/20/history-of-apis/

14 of 59

Early examples

Twitter API, 2006.

See: https://apievangelist.com/2012/12/20/history-of-apis/

15 of 59

What are APIs?

Allow interaction with web platform and extraction of data in digestible format
Were first introduced to allow developers to extend platform use to other domains (e.g., use data to host content on their own platform)

So what’s the difference between scraping and using an API?

16 of 59

Scraping versus APIs

Scraping:

Extracts data from public/visible webpage content
Reformats into usable format

APIs

Extract data from public/non-public and visible/non-visible webpage content
Is pre-packaged according to specified query
Can be achieved with dedicated libraries

17 of 59

Scraping versus APIs

Another way of putting this:�

Scraping gets content hosted on webpage according to identifiers, e.g. CSS tags (more on this later)�
APIs get content from nominally blank webpage, and asks (requests) the platform to fill it in with the content of the request

18 of 59

A hands-on example

Scraping Twitter…

😬

19 of 59

A hands-on example

Using Twitter API...

🙂

20 of 59

Where is the data?

21 of 59

Web-scraping

Scraping:

Potential data sources: universe of webpages in existence: >5bn.

22 of 59

APIs

APIs:

Potential APIs to use: >22k indexed on: https://www.programmableweb.com/apis

23 of 59

How have these tools been used?

24 of 59

Uses of Web scraping:

Women Are Seen More than Heard in Online Newspapers:

Jia et al. 2016. PLoS One.

Source(s): 2,353,652 news articles from over 950 news outlets

Data and code: ??? 😠

Libraries used: ??? 😠

25 of 59

Uses of Web scraping:

Auditing local news presence on Google News.

Fischer et al. 2016. Nature Human Behaviour.

Source(s): Google News (12.9m results)

Libraries used: Python Selenium

Data and code: https://osf.io/hwuxf/?view_only=3fa7499661df487689031e11b8ea20b4 😊

26 of 59

Uses of Web scraping:

Human language reveals a universal positivity bias.

Dodds et al. 2015. PNAS.

Source(s): Google Web Crawl (1T 5-grams)

Libraries used: pre-packaged data

Data and code: http://www.uvm.edu/storylab/share/papers/dodds2014a/data.html 😊

(see also) Response: https://www.pnas.org/content/112/23/E2983

27 of 59

Uses of Web scraping:

“No Fracking Way!”.

Vasi et al. 2015. American Sociological Review.

Source(s): MarcellusProtest.org

Data and code: ??? 😠

Libraries used: ??? 😠

28 of 59

Uses of APIs:

Monitoring global digital gender inequality using the online populations of Facebook and Google. Kashyap et al. 2020. Demographic Research.

APIs used: Google AdWords; Facebook Marketing

Data and code: https://www.demographic-research.org/volumes/vol43/27/ 😊

29 of 59

Uses of APIs:

Right-Wing YouTube: A Supply and Demand Perspective.

Munger and Phillips. 2020. International Journal of Press/Politics.

APIs used: YouTube API

Libaries used: Python “youtube-data-api”

Data and code: https://github.com/kmunger/YT_descriptive 😊

30 of 59

Uses of APIs:

Parents mention sons more often than daughters on social media.

Sivak and Smirnov. 2019. PNAS.

APIs used: VKontakte API

Libaries used: ??? 😠

Data: https://osf.io/4ncbu/ 😊

Code: ??? 😠

31 of 59

Uses of APIs:

Evidence from internet search data shows information-seeking responses. 2020. PNAS.

APIs used: Google Health Trends API

Libaries used: Author-written and Python apiclient

Data and code: https://github.com/anabento/GoogleBehaviorCovid 😊

32 of 59

APIs worth researching

Twitter API

https://cran.r-project.org/web/packages/rtweet/rtweet.pdf

YouTube API

https://youtube-data-api.readthedocs.io/en/latest/

Spotify API

https://www.rcharlie.com/spotifyr/

Genius API

https://cran.r-project.org/web/packages/geniusr/vignettes/geniusr.html

Guardian Newspaper API

https://cran.r-project.org/web/packages/guardianapi/vignettes/introduction.html

Or search “API” in https://cran.r-project.org/web/packages/available_packages_by_name.html

Or search API directory in Programmable Web: https://www.programmableweb.com/apis/directory

Or work through this collaborative online review of APIs, written in R: https://bookdown.org/paul/apis_for_social_scientists/

Or https://www.postman.com

33 of 59

Before we try our hand at all this...

34 of 59

Design considerations...

35 of 59

Unit selection

Sampling

Inference

Who or what do we want to study?
Over what time frame?

From where is the data coming?
Are there biases in the data generating process?

To what population do findings relate?
To what phenomenon do our data speak?

36 of 59

By way of example...

Individuals with depression express more distorted thinking on social media.

Bathina et al. 2021. Nature Human Behaviour.

Finding: depressed people more likely to exhibit distorted thinking

40 of 59

Unit selection

Sampling

Inference

(Self-reported) depressed individuals online
Last 3,200 tweets (i.e., time period varies)

Twitter
Biased by age, urban/rural, gender
API and time-period bias

Depressed people? Depressed people online?
Cognitive distortion? Expression of cognitive distortion online?

41 of 59

The legal bit…

42 of 59

Law and Data Scraping

Multiple considerations:

Is content copyrighted?; does scraping infringe IP rights?
Are there “terms of use” or “terms of service” for the platform?
Is the use non-commercial?

If so: is the website hosted in UK? A 2014 UK law and 2019 EU directive gives exemption for non-commercial research

Good practice:

Contact webpage owners in advance of scraping to prewarn
Optimize scraper to minimize demand
Run scraper at night or when webpage receives least traffic
See e.g. the ONS policy

43 of 59

The Case Law…

Many cases have been brought (though not against researchers) for uses of online data in violation of ToS.

First: Ebay versus Bidder’s Edge

Latest: HiQ v. LinkedIn, see: https://www.natlawreview.com/article/hiq-files-opposition-brief-supreme-court-linkedin-cfaa-data-scraping-dispute

In process: Van Buren v. United States

First Supreme Court ruling on meaning of “unauthorized access” stipulated in CFAA, see: https://themarkup.org/news/2020/12/03/why-web-scraping-is-vital-to-democracy

AOIR Ethics 3.0 Guidelines: https://aoir.org/ethics/

CREATe Report “Law of Data Scraping”: https://zenodo.org/record/4635759#.Yd7C6sanxf1

45 of 59

A basic decision tree

46 of 59

A basic decision tree

For more see: Matt Salganik, Bit by Bit: Social Research in the Digital Age. 2018. ch.6.

But: ethical questions intervene at each step. Legal ≠ ethical

47 of 59

Law

Ethics

Legal constraints placed by platforms on accessing content
E.g., not to use scrapers/crawlers

Ethical protection of user privacy/violations of contextual integrity
Protection of minors/vulnerable groups

48 of 59

“By employing TOS- compliant methods, you are respecting the business prerogatives of the company that created the platform you are studying, but you may or may not be respecting the dignity and privacy of the platform’s users”

49 of 59

“On the one hand, a purely legal argument may be mounted that in some circumstances the benefits to society from breaching the terms of service outweigh the detriments to the platform itself, or to specific users… [o]n the other hand, this is also the legal expression of a much broader moral argument that points to the need for independent, critical, public-interest scrutiny of social media spaces that are now critical to public communication across many societies..”

50 of 59

“A ‘regulation optimistic’ approach: we need a combination of data access options that are widely available to anyone with no restrictions, but that only cover very limited offerings, and more comprehensive options that allow more in-depth research, but with more specific preconditions… Only if one assumes that privacy is merely a convenient excuse to stifle critical research can this problem be ignored.”

51 of 59

“I don’t think researchers should not be automatically bound by such terms-of-service agreements. Ideally, if researchers violate terms- of-service agreements, they should explain their decision openly… as suggested by transparency-based accountability. But this openness may expose researchers to added legal risk; in the United States, for example, the Computer Fraud and Abuse Act may make it illegal to violate terms-of-service agreements…”

52 of 59

Private agreement as alternative?

Social Science One flagship example of this

See https://socialscience.one/

Not without obstacles, see: https://socialscience.one/blog/update-social-science-one

Advantages

Enhanced data access; bespoke for researcher needs; avoids unannounced access changes (APIs not intended for research needs after all…); is better than walking away (?!)

Shortcomings

Prohibitively demanding for independent researchers; concedes too much (?); means participating in corporate reputation washing (?)

53 of 59

OCR

Stands for “Optical Character Recognition”
Extracts text from images

E.g…:

55 of 59

Speech to text

ML engines to convert audio to text (e.g., for lectures and YouTube)
Google Cloud Engine Speech to Text

Has python support

Or… capture pre-genned YouTube captions

Using e.g.

1 of 59

2 of 59

3 of 59

4 of 59

5 of 59

6 of 59

7 of 59

8 of 59

9 of 59

10 of 59

11 of 59

12 of 59

13 of 59

14 of 59

15 of 59

16 of 59

17 of 59

18 of 59

19 of 59

20 of 59

21 of 59

22 of 59

23 of 59

24 of 59

25 of 59

26 of 59

27 of 59

28 of 59

29 of 59

30 of 59

31 of 59

32 of 59

33 of 59

34 of 59

35 of 59

36 of 59

37 of 59

38 of 59

39 of 59

40 of 59

41 of 59

42 of 59

43 of 59

44 of 59

45 of 59

46 of 59

47 of 59

48 of 59

49 of 59

50 of 59

51 of 59

52 of 59

53 of 59

54 of 59

55 of 59

56 of 59

57 of 59

58 of 59

59 of 59