SUSTAINABLE
SCRAPERS
PyData DC - Oct, 9, 2016
�NPR Visuals / @nprviz / http://blog.apps.npr.org
David Eads / @eads / davideads@gmail.com
TRIGGER WARNING
Mass shootings, guns, probably some salty language.
THINK LIKE A DATA JOURNALIST
“Much has been made of the fact that Orlando shooter Omar Mateen, who purchased both of the weapons he used in his deadly rampage legally, was twice investigated by the FBI for suspected terrorist ties and statements, but didn't end up on the terrorist watchlist. Even if he had, Mateen could still have legally purchased his guns; terrorism suspects aren't prohibited from owning firearms. Now, some typically on opposite sides of the gun control debate are pushing to change that.”
“But even if Mateen were on the watchlist, and even if known terrorists were barred from purchasing weapons from licensed dealers, unregulated private sales — made easier through online marketplaces like Armslist — still could have enabled him to buy his weapons.”
- Tasneem Raja
PLANNING TO SCRAPE
ASK YOURSELF …�DO I NEED A SCRAPER?
Scrapers are awesome, if you use them right. It’s worth knowing what they’re good for and what they take to do right.
IF YOU NEED A SCRAPER, YOU HAVE A DATA PROBLEM.
Scraping the output of what is likely a structured data source is “sub-optimal” as the computer nerds say.
#NEVERSCRAPE UNLESS YOU...
Can you scrape? Be a console and curl detective
ArmsList due diligence
Looks like they know how to run a server
Nothing prohibiting copying here
(and our lawyer agreed)
Scrapers are good at...
Scrapers require resources
SCRAPER ARCHITECTURE
Data models, tests, atomic requests, parallelization, cloud infrastructure.
Whoa, whoa, whoa wait. What about Scrapy, Mechanize, etc?
Sorry, we’re framework nihilists at NPR Viz.
Scrapy is awesome. Use it and enjoy it. Today, we’re talking about the technical choices and building blocks that make tools Scrapy so powerful and how you can build your own for exactly the problem you’re trying to solve.
The workflow
1. Model classes
Encapsulate parsing with data model classes
DATA MODELS
DATA MODELS
2. Scraper script
SCRAPER SCRIPT
3. Index scraper
We’ll skip the index scraper; it’s exactly the same idea as the listing scraper, but instead of URLs, it uses state names.
4. Controller script
Leveraging GNU Parallel
Scraping infrastructure
Proceed with care
Putting late capitalist technical infrastructure monopolies to work
Finishing up