Leon Yin ∙ Bloomberg | Ilica Mahajan ∙ The Marshall Project | Jeff Kao ∙ ProPublica
Large-scale scraping projects
2024-03-08, 10:15am
Roadmap
Leon: how to use a big chunky computer to do a large scraping project
Ilica: concepts for parallelizing big chunky computers for a large scraping project
Jeff: how we put these concepts into practice for a large scraping project
Still Loading
Leon Yin and Aaron Sankin
The Markup
Findings:
4 Internet Service Providers (ISPs) charge the same price for drastically different speeds based on where you live.
After collecting +1.1M Internet plans we found that…
In major cities across the U.S., neighborhoods that were:
… were disproportionately asked to overpay for slow speeds.
“No WAN’s Land” - Major, Teixeria, and Mayer 2020
Used broadband availability tools (BATs) to scrape address-level data that revealed ISPs vastly overstated the availability, speed, and competition of their services to the FCC.
Source: Screen recording of AT&T’s lookup tool.
Trial analysis: AT&T - Green Bay, Wisconsin
Who are the actors?
4 of the nation’s largest providers practice “tier flattening”:
AT&T, Verizon, CenturyLink, EarthLink
They serve 44 states and Washington D.C.
Challenges
Data Collection: Where to find addresses?
Simple 💪big😤data 👉 collect the largest city in each state.�Major cities focus on urban areas.
Found open source addresses:
Data Collection: using Undocumented APIs
Found and reverse-engineered underlying API’s powering each search portal.
Built four scrapers using similar set of sequential requests:
Only made possible by using a session to keep track of cookies and params.
Scraped ~100x addresses at once by making each scraper asynchronous.
Circumvented IP blocking us a proxy.
Example of a scraper as a Python function
Disclaimer: this is an not a functioning example!
Example of a scraper as an asynchronous function
Disclaimer: this is an not a functioning example!
Run asynchronously using the asyncio and aiohttp libraries.
The await notations used to make sure actions to finish sequentially.
Use session to keep track of cookies and state across requests.
Disclaimer: this is an not a functioning example!
Route requests through IP proxy to prevent rate limiting.
A proxy looks like this: {“http” : “http://example-proxy.com:5321}
Bookkeeping tips
Don’t repeat yourself: Structured naming system for outputs.
Make a todo list: AWS SQS (a queuing service)is useful for keeping tabs on what’s to be done.
Save receipts: Keep raw data, don’t parse until later.
Circumvent blocking: Use proxies as a last resort. Different levels of proxies.
Resources
Finding undocumented APIs - a tutorial that lives online�https://inspectelement.org/apis
The U.S. Place Sampler - a tool by Big Local News and The Markup to build random samples of street addresses in cities, block groups, tracts, and more.�https://usps.biglocalnews.org
Cases
Defendants
Charges
Airtable
GraphQL
Scraping
Scraping
Anti
Word of the day:
Idempotent
Containerize
DOCKER
FILES
a recipe to create containers
Parallelize
Iteration
Story: Series examining Google’s ad business
GOALS:
TASKS:
SCRAPING (multiple websites in parallel)
S3
database of all sites
docker/ecs/ecr/fargate
x 30
MOVING YOUR WORK INTO THE CLOUD
Docker Image
Container
PRACTICAL TIPS
General Tips
Scheduling: �Use cron (locally) or configure cloud instances to kick off at set times or intervals
Be discrete: �Break tasks up into smallest unit or work. �Use queuing to communicate between tasks.
Keep logs: �Record what was done to help troubleshoot.
Optimize: �Use queues and logs to find bottlenecks. Determine infra and N instances needed. �
Don’t always need to be fancy or in the cloud!
Parallelization: bit.ly/nicar2024_scrape
Thank you! Questions?
Leon Yin ∙ Bloomberg | Ilica Mahajan ∙ The Marshall Project | Jeff Kao ∙ ProPublica
New Story about OpenAI GPT Bias in hiring ->
Leon Yin, Davey Alba, Leonardo Nicoletti
Bloomberg News | March 7, 2024
Auditing Algorithms and AI for Bias
Leonardo Nicoletti, Victoria Turk, Leon Yin, Meredith Broussard
NICAR Baltimore - Saturday, March 9, 2:15 p.m. in Harborside C