1 of 62

SICSS Atlanta 2022

2 of 62

Announcements

Sign up for lightning talks (link pinned in the Slack & in your email)
Daily logistics

3 of 62

The Plan for Today

An example SICSS project
Introduction to digital trace data
Omar Asensio and Olga Churkina
Lunch & research Talk
Group work

4 of 62

SICSS Sample Project

Day 2

5 of 62

Trying to find a new apartment

6 of 62

Trying to find a new apartment

7 of 62

Where is the Data From?

Crimereports.com

Mines data from over 1000 participating agencies, however they report on their website that “Each agency controls their data flow to CrimeReports, including how often they send data, which incidents are included.

Lucido (2011): Trulia’s data is not weighted by population and gives limited crime type information
Coronado Realty Group (2016): crime rates are combined by nearby areas that are actually separated by large bodies of water

8 of 62

Trying to find a new apartment

9 of 62

Information Asymmetries Housing Markets

Who wins?

Real Estate Agents v. Individuals
People with resources or people without

Crime affects housing prices (Kat)
Consumer knowledge of foreclosure (Maria)

What if we could quantify the cost of the data distortion – thereby generating the literal price of misinformation?

12 of 62

counts

rates

14 of 62

Total Sample

Purchasing

52.68% of participants would not purchase
47.32% would purchase

Average cost

$246,362

15 of 62

Counts

Purchase price

251,298

Buying

No: 40%
Yes: 60%

16 of 62

Counts

Purchase price

251,298

Buying

No: 40%
Yes: 60%

Rates

Purchase Price

241,276

Buying

No: 66%
Yes: 34%

17 of 62

Results

Everything else about the house was held constant

The crime data is all true

It’s just projected differently

Change in projection changed willingness to purchase substantially and changed the purchase price by over 10 thousand dollars

18 of 62

What happened after SICSS

We applied for funding from SICSS
We ran a pretest

We then spent the next full year running some more pretests

We launched the real thing
What started as a comment, became a mock-up, then became a study

More importantly, it became a collaboration and a friendship between two people interested in the same problem who never would have worked together

19 of 62

Digital Trace Data

Day 2

20 of 62

Readymades vs. Custommades

Taking something that was created for one purpose and turning it into something different (creative re-purposing)

Digital Trace data
Data Scientists

Data created intentionally

Social science

Cleo the Clownfish from the Shedd Aquarium

21 of 62

Readymades vs. Custommades

Readymade: Twitter data

Custommade: survey data

Cleo the Clownfish from the Shedd Aquarium

22 of 62

What is digital trace data

Large, digital datasets that describe human behavior

(e.g. social media sites, internet search data, blogs, administrative records, historical archives, audio-visual data, or geospatial data)

23 of 62

Strengths of digital trace data

Big
Always on
Non-reactive
Captures social relationships

24 of 62

Weaknesses of digital trace data

Inaccessible
Non-representative

26 of 62

Weaknesses of digital trace data

Inaccessible
Non-representative
Drifting
Algorithmically confounded
Unstructured
Sensitive
Incomplete
Elite/publication bias

27 of 62

Application Programming Interfaces

With the advent of web 2.0 the internet became more inter-dependent
Over 22,000 APIs today (but a lot of them are not public)
Links to public APIs

28 of 62

Application Programming Interfaces

When we make an API, in theory we are stitching together a long url that is telling the web browser exactly what information we want

31 of 62

Testing some easy APIs where you control the call

https://v2.jokeapi.dev/joke/Any?safe-mode

32 of 62

Testing some easy APIs where you control the call

https://api.agify.io/?name=kat

33 of 62

Testing some easy APIs where you control the call

https://house-stock-watcher-data.s3-us-west-2.amazonaws.com/data/all_transactions.json

36 of 62

R Packages for APIs

Example: Reddit.com

Documentation https://rpubs.com/mswofford/redditAPI�

38 of 62

Throttling & Rate Limiting

Rate limiting: client-side response the maximum capacity of a channel

Throttling: server-side response providing feedback to the caller that there are too many requests coming in from that client or that the server is overloaded

39 of 62

Screen scraping basics in R

40 of 62

Screen-Scraping

Process of automatically extracting data from web pages

Usually lists of sites/pages that can’t be extracted by hand

4 Basic steps

Loads the name of a web-page to be scraped
Downloads the website in html or xml
Finds pieces of information on that website
Puts the information in a convenient format

Adapted from SICSS Day 2

42 of 62

Is it Illegal?

In the early days of the internet – a free for all

More recently, websites have tried to stop screen-scraping via Terms of Service

Often their screen-scraping policy is stored in a doc called ‘robots.txt’ in their terms of service

Difference between personal or research use and commercial use

Verdict: Use your best judgement and/or consult a lawyer

43 of 62

Let’s Try

https://en.wikipedia.org/wiki/World_Health_Organization_ranking_of_health_systems_in_2000

44 of 62

Setting up the R Environment

Install packages that we need
Make sure those packages are downloaded into our virtual environment

install.packages("rvest")

install.packages("selectr")

46 of 62

Tell R we want to use those packages now

47 of 62

Find what we want to scrape

Return to the webpage of interest (Wikipedia) and figure out where on the page the information we want actually is
But – we can’t look as people; we need to look as if we are the computer

Source code

In Google Chrome you can use the drop-down menu at the top of your screen and select View -> Developer -> View Source

49 of 62

Read this source code into R

Using the handy function read_html()

50 of 62

The Website is now in R…�but we need to parse the html file

53 of 62

Right click on the part of the webpage you want to scrape and choose ‘inspect’

54 of 62

Right click inside the developer window and select copy then copy Xpath

55 of 62

Now we can use this information to point R in the right direction

56 of 62

Put the information back into Table form

57 of 62

Complications

58 of 62

Parsing with a CSS Selector

Some websites are more complicated, and we need a more advanced way of finding the xpath
Here we will use an interactive tool (a Chrome Plug-in Extension) called “Selector Gadget”

You can find this at selectorgadget.com

59 of 62

Scraping Duke’s Mainpage (duke.edu)

60 of 62

Try clicking around the site with Selector Gadget to identify the xpath

61 of 62

Feed the information to R

62 of 62

Other Complications

Scraping many many pages within pages

Simply nest your code into a ‘for loop’ that will iterate through all available webpages

Needing to interact with a browser

Implement Selenium

Evading Captcha’s

1 of 62

2 of 62

3 of 62

4 of 62

5 of 62

6 of 62

7 of 62

8 of 62

9 of 62

10 of 62

11 of 62

12 of 62

13 of 62

14 of 62

15 of 62

16 of 62

17 of 62

18 of 62

19 of 62

20 of 62

21 of 62

22 of 62

23 of 62

24 of 62

25 of 62

26 of 62

27 of 62

28 of 62

29 of 62

30 of 62

31 of 62

32 of 62

33 of 62

34 of 62

35 of 62

36 of 62

37 of 62

38 of 62

39 of 62

40 of 62

41 of 62

42 of 62

43 of 62

44 of 62

45 of 62

46 of 62

47 of 62

48 of 62

49 of 62

50 of 62

51 of 62

52 of 62

53 of 62

54 of 62

55 of 62

56 of 62

57 of 62

58 of 62

59 of 62

60 of 62

61 of 62

62 of 62