1 of 38

Leon Yin ∙ Bloomberg | Ilica Mahajan ∙ The Marshall Project | Jeff Kao ∙ ProPublica

https://bit.ly/nicar2024scrapepres

Large-scale scraping projects

2024-03-08, 10:15am

2 of 38

Roadmap

Leon: how to use a big chunky computer to do a large scraping project

Ilica: concepts for parallelizing big chunky computers for a large scraping project

Jeff: how we put these concepts into practice for a large scraping project

3 of 38

Leon Yin and Aaron Sankin

The Markup

4 of 38

Findings:

4 Internet Service Providers (ISPs) charge the same price for drastically different speeds based on where you live.

After collecting +1.1M Internet plans we found that…

In major cities across the U.S., neighborhoods that were:

  • Historically redlined (22 / 22 = 100% cities)
  • Lower-income (35 / 38 = 92% cities)
  • Highest concentration of people of color (21 / 32 = 66% cities)

… were disproportionately asked to overpay for slow speeds.

5 of 38

“No WAN’s Land” - Major, Teixeria, and Mayer 2020

Used broadband availability tools (BATs) to scrape address-level data that revealed ISPs vastly overstated the availability, speed, and competition of their services to the FCC.

Source: Screen recording of AT&T’s lookup tool.

6 of 38

Trial analysis: AT&T - Green Bay, Wisconsin

7 of 38

8 of 38

Who are the actors?

4 of the nation’s largest providers practice “tier flattening”:

AT&T, Verizon, CenturyLink, EarthLink

They serve 44 states and Washington D.C.

9 of 38

Challenges

  • IP blocking
  • Browsers are slow
  • Where to get addresses?
  • Categorization?

10 of 38

Data Collection: Where to find addresses?

Simple 💪big😤data 👉 collect the largest city in each state.�Major cities focus on urban areas.

Found open source addresses:

  • Used OpenAddresses and NYC Open Data to collect 12M addresses from 45 cities.
  • Incomplete: Census Geocoder to merge incorporated city and block group.
  • Goal: Collect plans for 10% of addresses in each census block group (stratified sample) to collect representative samples in each city.

11 of 38

Data Collection: using Undocumented APIs

Found and reverse-engineered underlying API’s powering each search portal.

Built four scrapers using similar set of sequential requests:

  • Autocomplete and verify address
  • Choose apartment number
  • Check availability and list plans

Only made possible by using a session to keep track of cookies and params.

Scraped ~100x addresses at once by making each scraper asynchronous.

Circumvented IP blocking us a proxy.

12 of 38

Example of a scraper as a Python function

Disclaimer: this is an not a functioning example!

13 of 38

Example of a scraper as an asynchronous function

Disclaimer: this is an not a functioning example!

14 of 38

Run asynchronously using the asyncio and aiohttp libraries.

The await notations used to make sure actions to finish sequentially.

15 of 38

Use session to keep track of cookies and state across requests.

Disclaimer: this is an not a functioning example!

16 of 38

Route requests through IP proxy to prevent rate limiting.

A proxy looks like this: {“http” : “http://example-proxy.com:5321}

17 of 38

Bookkeeping tips

Don’t repeat yourself: Structured naming system for outputs.

Make a todo list: AWS SQS (a queuing service)is useful for keeping tabs on what’s to be done.

Save receipts: Keep raw data, don’t parse until later.

Circumvent blocking: Use proxies as a last resort. Different levels of proxies.

18 of 38

Resources

Finding undocumented APIs - a tutorial that lives online�https://inspectelement.org/apis

The U.S. Place Sampler - a tool by Big Local News and The Markup to build random samples of street addresses in cities, block groups, tracts, and more.�https://usps.biglocalnews.org

19 of 38

  • Building and analyzing a novel dataset about the criminal court system
  • Not easy to access, understand, or make use of
  • Want it to be queryable to answer questions

20 of 38

21 of 38

Cases

Defendants

Charges

Airtable

GraphQL

22 of 38

23 of 38

Scraping

  • All credit to Aaron Williams and David Eads for this part
  • AWS -> elastic container service = break down the problem into bite-sized chunks
  • Large scale projects are all about breaking things down, and thinking a series of functions/inputs/outputs
  • Write a scraper to scrape one page at a time, and funnel documents to an s3 bucket
  • Write a script that takes a start and end case number, scrapes those cases

24 of 38

Scraping

Anti

  • You may hit anti-scraping tech if you’re hitting the same domain over and over again
  • Block your IP
  • Can’t block the entire AWS space
  • So new containers get new IP’s

25 of 38

Word of the day:

Idempotent

26 of 38

Containerize

  • Bite-sized jobs
  • Dockerfiles: a recipe to create containers
  • Containers can be deployed locally or in the cloud
  • Separate steps: scrape to s3 bucket, then parse into tables
  • AWS to store containers
  • AWS ECS to run the containers in…

27 of 38

DOCKER

FILES

a recipe to create containers

28 of 38

Parallelize

  • Run thousands containers at once
  • For scraping against one domain, not too many in parallel
  • Once scraped files are in an s3 bucket, go wild
  • Tighten iteration loops

29 of 38

30 of 38

Iteration

  • Tighten loops, experiment, re-run, change schemas to suit analysis
  • Tunable run times to parse >100K documents
  • Hasura (GraphQL) + Observable (Analysis + Data Memos)
  • Up-to-date fact checking

31 of 38

Story: Series examining Google’s ad business

GOALS:

  1. Given a list of disinfo websites, figure out which ones are advertising with Google.
  2. Determine a list of all of websites on the internet advertising using Google (deanonymize sellers.json)

TASKS:

  • Design a system that can determine whether a site is advertising using Google’s ads platform.
  • Do this quickly (i.e., in parallel) for any arbitrarily long list of websites

32 of 38

SCRAPING (multiple websites in parallel)

S3

database of all sites

docker/ecs/ecr/fargate

x 30

33 of 38

  1. dockerize your scraping script
    • Set up Dockerfile (environment and code)
    • Create a docker image
  2. upload the docker image on ECR (Elastic Container Registry)
    • Essentially a repository of docker images
  3. define your container in ECS (Elastic Container Service)
    • Basically the type of computer you are running in the cloud
  4. kick off N instances
    • Via AWS console or with a script

MOVING YOUR WORK INTO THE CLOUD

Docker Image

Container

34 of 38

  • BUILDING AND TESTING YOUR SYSTEM
    • Build it step by step:
      • Local computer => Local docker image => One cloud instance => Multiple cloud instances
    • Separate scraping and analysis
      • Collect data in as raw a state as possible
    • Back-of-the-envelope math
      • Manage time, money and time spent building
  • MANAGING COSTS/TIME
    • Look at the cost structure of your tools; design your system to save money
      • Storage and compute
        • Compute: Fargate vs EC2 - Fargate is about 20% more expensive, but it shuts down when it finishes the task
        • Storage: S3 (cost per transaction: directory listing vs retrieving a file; ongoing storage costs)
    • Turn on cloud resources only when you’re using them!

PRACTICAL TIPS

35 of 38

General Tips

Scheduling: �Use cron (locally) or configure cloud instances to kick off at set times or intervals

Be discrete: �Break tasks up into smallest unit or work. �Use queuing to communicate between tasks.

Keep logs: �Record what was done to help troubleshoot.

Optimize: �Use queues and logs to find bottlenecks. Determine infra and N instances needed. �

Don’t always need to be fancy or in the cloud!

Parallelization: bit.ly/nicar2024_scrape

36 of 38

Thank you! Questions?

Leon Yin ∙ Bloomberg | Ilica Mahajan ∙ The Marshall Project | Jeff Kao ∙ ProPublica

37 of 38

New Story about OpenAI GPT Bias in hiring ->

Leon Yin, Davey Alba, Leonardo Nicoletti

Bloomberg News | March 7, 2024

38 of 38

Auditing Algorithms and AI for Bias

Leonardo Nicoletti, Victoria Turk, Leon Yin, Meredith Broussard

NICAR Baltimore - Saturday, March 9, 2:15 p.m. in Harborside C