1 of 62

SICSS Atlanta 2022

2 of 62

Announcements

  • Sign up for lightning talks (link pinned in the Slack & in your email)
  • Daily logistics

3 of 62

The Plan for Today

  • An example SICSS project
  • Introduction to digital trace data
  • Omar Asensio and Olga Churkina
  • Lunch & research Talk
  • Group work

4 of 62

SICSS Sample Project

Day 2

5 of 62

Trying to find a new apartment

6 of 62

Trying to find a new apartment

7 of 62

Where is the Data From?

  • Crimereports.com
    • Mines data from over 1000 participating agencies, however they report on their website that “Each agency controls their data flow to CrimeReports, including how often they send data, which incidents are included.

  • Lucido (2011): Trulia’s data is not weighted by population and gives limited crime type information
  • Coronado Realty Group (2016): crime rates are combined by nearby areas that are actually separated by large bodies of water

8 of 62

Trying to find a new apartment

9 of 62

Information Asymmetries Housing Markets

  • Who wins?
    • Real Estate Agents v. Individuals
    • People with resources or people without

  • Crime affects housing prices (Kat)
  • Consumer knowledge of foreclosure (Maria)

  • What if we could quantify the cost of the data distortion – thereby generating the literal price of misinformation?

10 of 62

11 of 62

counts

12 of 62

counts

rates

13 of 62

14 of 62

Total Sample

  • Purchasing
    • 52.68% of participants would not purchase
    • 47.32% would purchase
  • Average cost
    • $246,362

14

15 of 62

Counts

  • Purchase price
    • 251,298
  • Buying
    • No: 40%
    • Yes: 60%

15

16 of 62

Counts

  • Purchase price
    • 251,298
  • Buying
    • No: 40%
    • Yes: 60%

Rates

  • Purchase Price
    • 241,276
  • Buying
    • No: 66%
    • Yes: 34%

16

17 of 62

Results

  • Everything else about the house was held constant

  • The crime data is all true
    • It’s just projected differently

  • Change in projection changed willingness to purchase substantially and changed the purchase price by over 10 thousand dollars

17

18 of 62

What happened after SICSS

  • We applied for funding from SICSS
  • We ran a pretest
    • We then spent the next full year running some more pretests
  • We launched the real thing
  • What started as a comment, became a mock-up, then became a study

  • More importantly, it became a collaboration and a friendship between two people interested in the same problem who never would have worked together

19 of 62

Digital Trace Data

Day 2

20 of 62

Readymades vs. Custommades

  • Taking something that was created for one purpose and turning it into something different (creative re-purposing)
    • Digital Trace data
    • Data Scientists

  • Data created intentionally
    • Social science

Cleo the Clownfish from the Shedd Aquarium

21 of 62

Readymades vs. Custommades

  • Readymade: Twitter data

  • Custommade: survey data

Cleo the Clownfish from the Shedd Aquarium

22 of 62

What is digital trace data

  • Large, digital datasets that describe human behavior
    • (e.g. social media sites, internet search data, blogs, administrative records, historical archives, audio-visual data, or geospatial data)

23 of 62

Strengths of digital trace data

  • Big
  • Always on
  • Non-reactive
  • Captures social relationships

24 of 62

Weaknesses of digital trace data

  • Inaccessible
  • Non-representative

25 of 62

26 of 62

Weaknesses of digital trace data

  • Inaccessible
  • Non-representative
  • Drifting
  • Algorithmically confounded
  • Unstructured
  • Sensitive
  • Incomplete
  • Elite/publication bias

27 of 62

Application Programming Interfaces

  • With the advent of web 2.0 the internet became more inter-dependent
  • Over 22,000 APIs today (but a lot of them are not public)
  • Links to public APIs

28 of 62

Application Programming Interfaces

  • When we make an API, in theory we are stitching together a long url that is telling the web browser exactly what information we want

29 of 62

30 of 62

31 of 62

Testing some easy APIs where you control the call

32 of 62

Testing some easy APIs where you control the call

  • https://api.agify.io/?name=kat

33 of 62

Testing some easy APIs where you control the call

34 of 62

35 of 62

36 of 62

R Packages for APIs

  • Example: Reddit.com

  • Documentation https://rpubs.com/mswofford/redditAPI�

37 of 62

38 of 62

Throttling & Rate Limiting

Rate limiting: client-side response the maximum capacity of a channel

Throttling: server-side response providing feedback to the caller that there are too many requests coming in from that client or that the server is overloaded

39 of 62

Screen scraping basics in R

 

40 of 62

Screen-Scraping

  • Process of automatically extracting data from web pages

  • Usually lists of sites/pages that can’t be extracted by hand

  • 4 Basic steps
    • Loads the name of a web-page to be scraped
    • Downloads the website in html or xml
    • Finds pieces of information on that website
    • Puts the information in a convenient format

Adapted from SICSS Day 2

41 of 62

42 of 62

Is it Illegal?

  • In the early days of the internet – a free for all

  • More recently, websites have tried to stop screen-scraping via Terms of Service
    • Often their screen-scraping policy is stored in a doc called ‘robots.txt’ in their terms of service

  • Difference between personal or research use and commercial use

  • Verdict: Use your best judgement and/or consult a lawyer

43 of 62

Let’s Try

  • https://en.wikipedia.org/wiki/World_Health_Organization_ranking_of_health_systems_in_2000

44 of 62

Setting up the R Environment

  • Install packages that we need
  • Make sure those packages are downloaded into our virtual environment

install.packages("rvest")

install.packages("selectr")

45 of 62

46 of 62

Tell R we want to use those packages now

47 of 62

Find what we want to scrape

  • Return to the webpage of interest (Wikipedia) and figure out where on the page the information we want actually is
  • But – we can’t look as people; we need to look as if we are the computer

  • Source code
    • In Google Chrome you can use the drop-down menu at the top of your screen and select View -> Developer -> View Source

48 of 62

49 of 62

Read this source code into R

  • Using the handy function read_html()

50 of 62

The Website is now in R…�but we need to parse the html file

51 of 62

52 of 62

53 of 62

Right click on the part of the webpage you want to scrape and choose ‘inspect’

54 of 62

Right click inside the developer window and select copy then copy Xpath

55 of 62

Now we can use this information to point R in the right direction

56 of 62

Put the information back into Table form

57 of 62

Complications

58 of 62

Parsing with a CSS Selector

  • Some websites are more complicated, and we need a more advanced way of finding the xpath
  • Here we will use an interactive tool (a Chrome Plug-in Extension) called “Selector Gadget”
    • You can find this at selectorgadget.com

59 of 62

Scraping Duke’s Mainpage (duke.edu)

60 of 62

Try clicking around the site with Selector Gadget to identify the xpath

61 of 62

Feed the information to R

62 of 62

Other Complications

  • Scraping many many pages within pages
    • Simply nest your code into a ‘for loop’ that will iterate through all available webpages

  • Needing to interact with a browser
    • Implement Selenium

  • Evading Captcha’s