Computational Text Analysis
Christopher Barrieļæ½
Week 8��
Introduction
Introduction
What is web-scraping/screen-scraping
Early examples
World Wide Web Wanderer, Matthew Gray, 1993.
World Wide Web Wanderer Reports
World Wide Web Wanderer Reports
Early examples
JumpStation, Jonathon Fletcher, 1993
And then⦠on StackOverflow
What is web-scraping/screen-scraping
Why is it needed?
Introduction
What are APIs?
Early examples
Ebay API, 2000.
Early examples
Twitter API, 2006.
What are APIs?
Scraping versus APIs
Scraping versus APIs
A hands-on example
š¬
A hands-on example
š
Where is the data?
Web-scraping
APIs
How have these tools been used?
Uses of Web scraping:
Women Are Seen More than Heard in Online Newspapers:
Jia et al. 2016. PLoS One.
Source(s): 2,353,652 news articles from over 950 news outlets
Data and code: ??? š
Libraries used: ??? š
Uses of Web scraping:
Auditing local news presence on Google News.
Fischer et al. 2016. Nature Human Behaviour.
Source(s): Google News (12.9m results)
Libraries used: Python Selenium
Data and code: https://osf.io/hwuxf/?view_only=3fa7499661df487689031e11b8ea20b4 š
Uses of Web scraping:
Human language reveals a universal positivity bias.
Dodds et al. 2015. PNAS.
Source(s): Google Web Crawl (1T 5-grams)
Libraries used: pre-packaged data
Data and code: http://www.uvm.edu/storylab/share/papers/dodds2014a/data.html š
(see also) Response: https://www.pnas.org/content/112/23/E2983
Uses of Web scraping:
āNo Fracking Way!ā.
Vasi et al. 2015. American Sociological Review.
Source(s): MarcellusProtest.org
Data and code: ??? š
Libraries used: ??? š
Uses of APIs:
Monitoring global digital gender inequality using the online populations of Facebook and Google. Kashyap et al. 2020. Demographic Research.
APIs used: Google AdWords; Facebook Marketing
Data and code: https://www.demographic-research.org/volumes/vol43/27/ š
Uses of APIs:
Right-Wing YouTube: A Supply and Demand Perspective.
Munger and Phillips. 2020. International Journal of Press/Politics.
APIs used: YouTube API
Libaries used: Python āyoutube-data-apiā
Data and code: https://github.com/kmunger/YT_descriptive š
Uses of APIs:
Parents mention sons more often than daughters on social media.
Sivak and Smirnov. 2019. PNAS.
Uses of APIs:
Evidence from internet search data shows information-seeking responses. 2020. PNAS.
APIs used: Google Health Trends API
Libaries used: Author-written and Python apiclient
Data and code: https://github.com/anabento/GoogleBehaviorCovid š
APIs worth researching
Or search āAPIā in https://cran.r-project.org/web/packages/available_packages_by_name.html
Or search API directory in Programmable Web: https://www.programmableweb.com/apis/directory
Or work through this collaborative online review of APIs, written in R: https://bookdown.org/paul/apis_for_social_scientists/
Before we try our hand at all this...
Design considerations...
Unit selection
Sampling
Inference
By way of example...
Individuals with depression express more distorted thinking on social media.
Bathina et al. 2021. Nature Human Behaviour.
Finding: depressed people more likely to exhibit distorted thinking
Unit selection
Sampling
Inference
The legal bitā¦
Law and Data Scraping
The Case Lawā¦
AOIR Ethics 3.0 Guidelines: https://aoir.org/ethics/
CREATe Report āLaw of Data Scrapingā: https://zenodo.org/record/4635759#.Yd7C6sanxf1
A basic decision tree
A basic decision tree
For more see: Matt Salganik, Bit by Bit: Social Research in the Digital Age. 2018. ch.6.
But: ethical questions intervene at each step. Legal ā ethical
Law
Ethics
āBy employing TOS- compliant methods, you are respecting the business prerogatives of the company that created the platform you are studying, but you may or may not be respecting the dignity and privacy of the platformās usersā
āOn the one hand, a purely legal argument may be mounted that in some circumstances the benefits to society from breaching the terms of service outweigh the detriments to the platform itself, or to specific users⦠[o]n the other hand, this is also the legal expression of a much broader moral argument that points to the need for independent, critical, public-interest scrutiny of social media spaces that are now critical to public communication across many societies..ā
āA āregulation optimisticā approach: we need a combination of data access options that are widely available to anyone with no restrictions, but that only cover very limited offerings, and more comprehensive options that allow more in-depth research, but with more specific preconditions⦠Only if one assumes that privacy is merely a convenient excuse to stifle critical research can this problem be ignored.ā
āI donāt think researchers should not be automatically bound by such terms-of-service agreements. Ideally, if researchers violate terms- of-service agreements, they should explain their decision openly⦠as suggested by transparency-based accountability. But this openness may expose researchers to added legal risk; in the United States, for example, the Computer Fraud and Abuse Act may make it illegal to violate terms-of-service agreementsā¦ā
Private agreement as alternative?
OCR
Speech to text
The course book...
Thanks!