1 of 38

SUSTAINABLE

SCRAPERS

PyData DC - Oct, 9, 2016

http://shoutkey.com/from

�NPR Visuals / @nprviz / http://blog.apps.npr.org

David Eads / @eads / davideads@gmail.com

2 of 38

TRIGGER WARNING

Mass shootings, guns, probably some salty language.

3 of 38

THINK LIKE A DATA JOURNALIST

We’re going to get technical today by talking about scrapers. We’re going to talk about fun stuff like decorators and GNU Parallel and cloud infrastructure. The truth is, I kind of hate getting technical. I’m trying to do my part to liberate humans from the iron cages of technology we create for ourselves. More importantly, I have a small budget and tight deadlines. My team can’t afford to screw around on technical stuff for its own sake. If a scraper can help tell a story, then I’m all in. If a scraper is really cool but doesn’t tell me anything, it’s not worth doing.

This kind of thinking is a hallmark of data journalism. Most of you probably aren’t journalists, but the skills required to do data journalism well are crucial just about everywhere these days. I realized this after talking with my sister a few years ago.

My sister is a manager with an MBA who works in health care. She leans a lot on Excel and some powerful business intelligence add-ons, which gets her pretty far. But she could use someone who could normalize and join data between a bunch of different enterprise databases, who could summarize the combined data, and who could visualize it to see the story. She needs a person who thinks like a data journalist.

Let me tell you about a story I worked on, and how we decided it was worth doing.

4 of 38

Semi-Automatic Weapons Without A Background Check Can Be Just A Click Away

This is a story I did with Tasneem Raja, Juan Elosua, Brittany Mayes, and Alyson Hurt about the availability of guns on the (largely unregulated) private market in response to the mass shooting in Orlando this past June.

Tasneem and I cooked up the idea together. Tasneem had already covered mass shootings extensively for Mother Jones. In the aftermath of the Orlando massacre, I had become curious about about online gun marketplaces and found Armslist.com, a sort of Craigslist for guns.

Tasneem did the reporting, I wrote the scraper, Brittany and Juan analyzed and visualized the data, and Alyson edited the graphics. We turned the project around in about 4 days.

We weren’t trying to investigate ArmsList per se. there’s lots of gun markets online, ArmsList is just one of the biggest. And we didn’t find anything that indicated anything fishy about ArmsList. The US is a big country with lots of guns and lots of smartphones -- online marketplaces are a natural outgrowth of the environment we live in.

5 of 38

6 of 38

7 of 38

“Much has been made of the fact that Orlando shooter Omar Mateen, who purchased both of the weapons he used in his deadly rampage legally, was twice investigated by the FBI for suspected terrorist ties and statements, but didn't end up on the terrorist watchlist. Even if he had, Mateen could still have legally purchased his guns; terrorism suspects aren't prohibited from owning firearms. Now, some typically on opposite sides of the gun control debate are pushing to change that.”

8 of 38

“But even if Mateen were on the watchlist, and even if known terrorists were barred from purchasing weapons from licensed dealers, unregulated private sales — made easier through online marketplaces like Armslist — still could have enabled him to buy his weapons.”

- Tasneem Raja

9 of 38

PLANNING TO SCRAPE

10 of 38

ASK YOURSELF …�DO I NEED A SCRAPER?

Scrapers are awesome, if you use them right. It’s worth knowing what they’re good for and what they take to do right.

11 of 38

IF YOU NEED A SCRAPER, YOU HAVE A DATA PROBLEM.

Scraping the output of what is likely a structured data source is “sub-optimal” as the computer nerds say.

12 of 38

#NEVERSCRAPE UNLESS YOU...

have no other way of liberating the data
budget appropriately
consider the ethical ramifications
read terms of service and do your legal research
talk to a lawyer (if you possibly can)

13 of 38

Can you scrape? Be a console and curl detective

What’s the URL structure?
What’s the page structure?
What kind of server is on the other end?

14 of 38

ArmsList due diligence

No better data source
Predictable URL structure
Decently formatted pages
Gleefully libertarian terms of service
OK’d with legal department
Servers fast, seemingly robust, no bad response header smells

15 of 38

Looks like they know how to run a server

16 of 38

17 of 38

Nothing prohibiting copying here

(and our lawyer agreed)

18 of 38

Understanding an information system / data model
Pressuring institutions to release data
Liberating data into structured formats
Occasionally even getting useful data

Scrapers are good at...

19 of 38

Development: This is usually the easy part!
Testing: Scraping is all edge cases
Infrastructure: Non-trivial costs
Maintenance if you need to run periodically

Scrapers require resources

20 of 38

SCRAPER ARCHITECTURE

Data models, tests, atomic requests, parallelization, cloud infrastructure.

21 of 38

Whoa, whoa, whoa wait. What about Scrapy, Mechanize, etc?

22 of 38

Sorry, we’re framework nihilists at NPR Viz.

Scrapy is awesome. Use it and enjoy it. Today, we’re talking about the technical choices and building blocks that make tools Scrapy so powerful and how you can build your own for exactly the problem you’re trying to solve.

23 of 38

The workflow

Controller script:

Harvests listing URLs from site index with index scraper
Takes list of URLs, launches one listing scraper per URL

Index scraper

Scrapes ArmsList index for listing URLs

Listing scraper:

Takes URL and makes http request
Parses response with model class instance
Serializes model instance data to console

24 of 38

https://github.com/nprapps/armslist-scraper/blob/master/scrape.sh

25 of 38

1. Model classes

Controller script:

Harvests listing URLs from site index with index scraper
Takes list of URLs, launches one listing scraper per URL

Index scraper

Scrapes ArmsList index for listing URLs

Listing scraper:

Takes URL and makes http request
Parses response with model class instance
Serializes model instance data to console

26 of 38

Encapsulate parsing with data model classes

Input: HTML/text to parse
Class interface should look like your output
Hide calculations (e.g. age from date of birth) and complexity (combining or splitting fields) through the interface

DATA MODELS

27 of 38

DATA MODELS

https://github.com/nprapps/armslist-scraper/blob/master/models/listing.py

Here’s the model class for an individual page.

There’s some initialization stuff, but the key things to note are:

Use of @property to hide processing complexity. @property is a magic decorator that allows an instance method to act like an instance property.
Hiding complexity like removing dollar sign (if it exists) from price.

One of the questions I have heard a lot is “why use classes to model the data?”

This goes to the idea of agnostic components: Consumers of your cleaned data (tests, database scripts, etc) should not need to know about the data that came in, where it came from, etc. Model classes and some Python syntactic sugar give us a way of separating the input from the output.

This separation lets you write tests that rely on fixtures rather than requiring network connectivity. It leads to clean code where each function does what you expect with zero or very few side effects. It lets your layer on caching and other sophisticated control structures rather than baking them into your data.

28 of 38

2. Scraper script

Controller script:

Harvests listing URLs from site index with index scraper
Takes list of URLs, launches one listing scraper per URL

Index scraper

Scrapes ArmsList index for listing URLs

Listing scraper:

Takes URL and makes http request
Parses response with model class instance
Serializes model instance data to console

29 of 38

SCRAPER SCRIPT

https://github.com/nprapps/armslist-scraper/blob/master/scrape_listing.py

Here’s the model class for an individual page.

There’s some initialization stuff, but the key things to note are:

Use of @property to hide processing complexity. @property is a magic decorator that allows an instance method to act like an instance property.
Hiding complexity like removing dollar sign (if it exists) from price.

One of the questions I have heard a lot is “why use classes to model the data?”

This goes to the idea of agnostic components: Consumers of your cleaned data (tests, database scripts, etc) should not need to know about the data that came in, where it came from, etc. Model classes and some Python syntactic sugar give us a way of separating the input from the output.

This separation lets you write tests that rely on fixtures rather than requiring network connectivity. It leads to clean code where each function does what you expect with zero or very few side effects. It lets your layer on caching and other sophisticated control structures rather than baking them into your data.

30 of 38

3. Index scraper

Controller script:

Harvests listing URLs from site index with index scraper
Takes list of URLs, launches one listing scraper per URL

Index scraper

Scrapes ArmsList index for listing URLs

Listing scraper

Takes URL and makes http request
Parses response with model class instance
Serializes model instance data to console

31 of 38

We’ll skip the index scraper; it’s exactly the same idea as the listing scraper, but instead of URLs, it uses state names.

32 of 38

4. Controller script

Controller script:

Harvests listing URLs from site index with index scraper
Takes list of URLs, launches one listing scraper per URL

Index scraper

Scrapes ArmsList index for listing URLs

Listing scraper

Takes URL and makes http request
Parses response with model class instance
Serializes model instance data to console

33 of 38

Leveraging GNU Parallel

Parallelization in Python is a bit of a pain -- it works, but it’s easy enough to screw up.
GNU Parallel magically parallelizes any shell command
Handles stdin/stdout gracefully
Fits naturally into simple shell processing workflows

34 of 38

https://github.com/nprapps/armslist-scraper/blob/master/scrape.sh

35 of 38

Scraping infrastructure

Cloud services like Amazon EC2 are very helpful
We used an Amazon c3.8xlarge
Scraped 80k URLs in about 16 minutes (about $0.50 in computing costs per scrape)
Don’t accidentally create a denial of service attack!

36 of 38

Proceed with care

It’s hard to know what level of traffic will know out a server
Increase number of parallel requests incrementally; back off if you notice ANY slowdown or hiccups
Be patient: Better to put a 1 second wait and have your scraper run for hours than the alternative

37 of 38

Putting late capitalist technical infrastructure monopolies to work

Always good to put your scraper near the site you’re scraping
Using MaxMind GeoIP lookup to choose our EC2 region, we realized the ArmsList.com servers are in the Amazon’s US East data center
By running our scraper in Amazon’s US East data center, we got lightning fast network performance

38 of 38

Finishing up

Publish the CSV with the story
Document the code
Todos for NPR Viz

Write tests (though at least we can)
Use incremental listing IDs instead of initial scrape
Other kinds of analysis