1 of 38

Leon Yin ∙ Bloomberg | Ilica Mahajan ∙ The Marshall Project | Jeff Kao ∙ ProPublica

https://bit.ly/nicar2024scrapepres

Large-scale scraping projects

2024-03-08, 10:15am

2 of 38

Roadmap

Leon: how to use a big chunky computer to do a large scraping project

Ilica: concepts for parallelizing big chunky computers for a large scraping project

Jeff: how we put these concepts into practice for a large scraping project

3 of 38

Still Loading

https://themarkup.org/still-loading

Leon Yin and Aaron Sankin

The Markup

4 of 38

Findings:

4 Internet Service Providers (ISPs) charge the same price for drastically different speeds based on where you live.

After collecting +1.1M Internet plans we found that…

In major cities across the U.S., neighborhoods that were:

Historically redlined (22 / 22 = 100% cities)
Lower-income (35 / 38 = 92% cities)
Highest concentration of people of color (21 / 32 = 66% cities)

… were disproportionately asked to overpay for slow speeds.

5 of 38

“No WAN’s Land” - Major, Teixeria, and Mayer 2020

Used broadband availability tools (BATs) to scrape address-level data that revealed ISPs vastly overstated the availability, speed, and competition of their services to the FCC.

Source: Screen recording of AT&T’s lookup tool.

6 of 38

Trial analysis: AT&T - Green Bay, Wisconsin

8 of 38

Who are the actors?

4 of the nation’s largest providers practice “tier flattening”:

AT&T, Verizon, CenturyLink, EarthLink

They serve 44 states and Washington D.C.

9 of 38

Challenges

IP blocking
Browsers are slow
Where to get addresses?
Categorization?

10 of 38

Data Collection: Where to find addresses?

Simple 💪big😤data 👉 collect the largest city in each state.�Major cities focus on urban areas.

Found open source addresses:

Used OpenAddresses and NYC Open Data to collect 12M addresses from 45 cities.
Incomplete: Census Geocoder to merge incorporated city and block group.
Goal: Collect plans for 10% of addresses in each census block group (stratified sample) to collect representative samples in each city.

11 of 38

Data Collection: using Undocumented APIs

Found and reverse-engineered underlying API’s powering each search portal.

Built four scrapers using similar set of sequential requests:

Autocomplete and verify address
Choose apartment number
Check availability and list plans

Only made possible by using a session to keep track of cookies and params.

Scraped ~100x addresses at once by making each scraper asynchronous.

Circumvented IP blocking us a proxy.

12 of 38

Example of a scraper as a Python function

Disclaimer: this is an not a functioning example!

13 of 38

Example of a scraper as an asynchronous function

Disclaimer: this is an not a functioning example!

14 of 38

Run asynchronously using the asyncio and aiohttp libraries.

The await notations used to make sure actions to finish sequentially.

15 of 38

Use session to keep track of cookies and state across requests.

Disclaimer: this is an not a functioning example!

16 of 38

Route requests through IP proxy to prevent rate limiting.

A proxy looks like this: {“http” : “http://example-proxy.com:5321}

17 of 38

Bookkeeping tips

Don’t repeat yourself: Structured naming system for outputs.

Make a todo list: AWS SQS (a queuing service)is useful for keeping tabs on what’s to be done.

Save receipts: Keep raw data, don’t parse until later.

Circumvent blocking: Use proxies as a last resort. Different levels of proxies.

18 of 38

Resources

Finding undocumented APIs - a tutorial that lives online�https://inspectelement.org/apis

The U.S. Place Sampler - a tool by Big Local News and The Markup to build random samples of street addresses in cities, block groups, tracts, and more.�https://usps.biglocalnews.org

19 of 38

Building and analyzing a novel dataset about the criminal court system
Not easy to access, understand, or make use of
Want it to be queryable to answer questions

21 of 38

Cases

Defendants

Charges

Airtable

GraphQL

23 of 38

Scraping

All credit to Aaron Williams and David Eads for this part
AWS -> elastic container service = break down the problem into bite-sized chunks
Large scale projects are all about breaking things down, and thinking a series of functions/inputs/outputs
Write a scraper to scrape one page at a time, and funnel documents to an s3 bucket
Write a script that takes a start and end case number, scrapes those cases

24 of 38

Scraping

Anti

You may hit anti-scraping tech if you’re hitting the same domain over and over again
Block your IP
Can’t block the entire AWS space
So new containers get new IP’s

25 of 38

Word of the day:

Idempotent

26 of 38

Containerize

Bite-sized jobs
Dockerfiles: a recipe to create containers
Containers can be deployed locally or in the cloud
Separate steps: scrape to s3 bucket, then parse into tables
AWS to store containers
AWS ECS to run the containers in…

27 of 38

DOCKER

FILES

a recipe to create containers

28 of 38

Parallelize

Run thousands containers at once
For scraping against one domain, not too many in parallel
Once scraped files are in an s3 bucket, go wild
Tighten iteration loops

30 of 38

Iteration

Tighten loops, experiment, re-run, change schemas to suit analysis
Tunable run times to parse >100K documents
Hasura (GraphQL) + Observable (Analysis + Data Memos)
Up-to-date fact checking

31 of 38

Story: Series examining Google’s ad business

GOALS:

Given a list of disinfo websites, figure out which ones are advertising with Google.
Determine a list of all of websites on the internet advertising using Google (deanonymize sellers.json)

TASKS:

Design a system that can determine whether a site is advertising using Google’s ads platform.
Do this quickly (i.e., in parallel) for any arbitrarily long list of websites

32 of 38

SCRAPING (multiple websites in parallel)

database of all sites

docker/ecs/ecr/fargate

x 30

33 of 38

dockerize your scraping script

Set up Dockerfile (environment and code)
Create a docker image

upload the docker image on ECR (Elastic Container Registry)

Essentially a repository of docker images

define your container in ECS (Elastic Container Service)

Basically the type of computer you are running in the cloud

kick off N instances

Via AWS console or with a script

MOVING YOUR WORK INTO THE CLOUD

Docker Image

Container

34 of 38

BUILDING AND TESTING YOUR SYSTEM

Build it step by step:

Local computer => Local docker image => One cloud instance => Multiple cloud instances

Separate scraping and analysis

Collect data in as raw a state as possible

Back-of-the-envelope math

Manage time, money and time spent building

MANAGING COSTS/TIME

Look at the cost structure of your tools; design your system to save money

Storage and compute

Compute: Fargate vs EC2 - Fargate is about 20% more expensive, but it shuts down when it finishes the task
Storage: S3 (cost per transaction: directory listing vs retrieving a file; ongoing storage costs)

Turn on cloud resources only when you’re using them!

PRACTICAL TIPS

35 of 38

General Tips

Scheduling: �Use cron (locally) or configure cloud instances to kick off at set times or intervals

Be discrete: �Break tasks up into smallest unit or work. �Use queuing to communicate between tasks.

Keep logs: �Record what was done to help troubleshoot.

Optimize: �Use queues and logs to find bottlenecks. Determine infra and N instances needed. �

Don’t always need to be fancy or in the cloud!

Parallelization: bit.ly/nicar2024_scrape

36 of 38

Thank you! Questions?

Leon Yin ∙ Bloomberg | Ilica Mahajan ∙ The Marshall Project | Jeff Kao ∙ ProPublica

37 of 38

New Story about OpenAI GPT Bias in hiring ->

Leon Yin, Davey Alba, Leonardo Nicoletti

Bloomberg News | March 7, 2024

38 of 38

Auditing Algorithms and AI for Bias

Leonardo Nicoletti, Victoria Turk, Leon Yin, Meredith Broussard

NICAR Baltimore - Saturday, March 9, 2:15 p.m. in Harborside C

1 of 38

2 of 38

3 of 38

4 of 38

5 of 38

6 of 38

7 of 38

8 of 38

9 of 38

10 of 38

11 of 38

12 of 38

13 of 38

14 of 38

15 of 38

16 of 38

17 of 38

18 of 38

19 of 38

20 of 38

21 of 38

22 of 38

23 of 38

24 of 38

25 of 38

26 of 38

27 of 38

28 of 38

29 of 38

30 of 38

31 of 38

32 of 38

33 of 38

34 of 38

35 of 38

36 of 38

37 of 38

38 of 38