1 of 38

SUSTAINABLE

SCRAPERS

PyData DC - Oct, 9, 2016

http://shoutkey.com/from

�NPR Visuals / @nprviz / http://blog.apps.npr.org

David Eads / @eads / davideads@gmail.com

2 of 38

TRIGGER WARNING

Mass shootings, guns, probably some salty language.

3 of 38

THINK LIKE A DATA JOURNALIST

4 of 38

5 of 38

6 of 38

7 of 38

“Much has been made of the fact that Orlando shooter Omar Mateen, who purchased both of the weapons he used in his deadly rampage legally, was twice investigated by the FBI for suspected terrorist ties and statements, but didn't end up on the terrorist watchlist. Even if he had, Mateen could still have legally purchased his guns; terrorism suspects aren't prohibited from owning firearms. Now, some typically on opposite sides of the gun control debate are pushing to change that.”

8 of 38

“But even if Mateen were on the watchlist, and even if known terrorists were barred from purchasing weapons from licensed dealers, unregulated private sales — made easier through online marketplaces like Armslist — still could have enabled him to buy his weapons.”

- Tasneem Raja

9 of 38

PLANNING TO SCRAPE

10 of 38

ASK YOURSELF …�DO I NEED A SCRAPER?

Scrapers are awesome, if you use them right. It’s worth knowing what they’re good for and what they take to do right.

11 of 38

IF YOU NEED A SCRAPER, YOU HAVE A DATA PROBLEM.

Scraping the output of what is likely a structured data source is “sub-optimal” as the computer nerds say.

12 of 38

#NEVERSCRAPE UNLESS YOU...

  • have no other way of liberating the data
  • budget appropriately
  • consider the ethical ramifications
  • read terms of service and do your legal research
  • talk to a lawyer (if you possibly can)

13 of 38

Can you scrape? Be a console and curl detective

  • What’s the URL structure?
  • What’s the page structure?
  • What kind of server is on the other end?

14 of 38

ArmsList due diligence

  • No better data source
  • Predictable URL structure
  • Decently formatted pages
  • Gleefully libertarian terms of service
  • OK’d with legal department
  • Servers fast, seemingly robust, no bad response header smells

15 of 38

Looks like they know how to run a server

16 of 38

17 of 38

Nothing prohibiting copying here

(and our lawyer agreed)

18 of 38

  • Understanding an information system / data model
  • Pressuring institutions to release data
  • Liberating data into structured formats
  • Occasionally even getting useful data

Scrapers are good at...

19 of 38

  • Development: This is usually the easy part!
  • Testing: Scraping is all edge cases
  • Infrastructure: Non-trivial costs
  • Maintenance if you need to run periodically

Scrapers require resources

20 of 38

SCRAPER ARCHITECTURE

Data models, tests, atomic requests, parallelization, cloud infrastructure.

21 of 38

Whoa, whoa, whoa wait. What about Scrapy, Mechanize, etc?

22 of 38

Sorry, we’re framework nihilists at NPR Viz.

Scrapy is awesome. Use it and enjoy it. Today, we’re talking about the technical choices and building blocks that make tools Scrapy so powerful and how you can build your own for exactly the problem you’re trying to solve.

23 of 38

The workflow

  • Controller script:
    • Harvests listing URLs from site index with index scraper
    • Takes list of URLs, launches one listing scraper per URL
  • Index scraper
    • Scrapes ArmsList index for listing URLs
  • Listing scraper:
    • Takes URL and makes http request
    • Parses response with model class instance
    • Serializes model instance data to console

24 of 38

25 of 38

1. Model classes

  • Controller script:
    • Harvests listing URLs from site index with index scraper
    • Takes list of URLs, launches one listing scraper per URL
  • Index scraper
    • Scrapes ArmsList index for listing URLs
  • Listing scraper:
    • Takes URL and makes http request
    • Parses response with model class instance
    • Serializes model instance data to console

26 of 38

Encapsulate parsing with data model classes

  • Input: HTML/text to parse
  • Class interface should look like your output
  • Hide calculations (e.g. age from date of birth) and complexity (combining or splitting fields) through the interface

DATA MODELS

27 of 38

DATA MODELS

28 of 38

2. Scraper script

  • Controller script:
    • Harvests listing URLs from site index with index scraper
    • Takes list of URLs, launches one listing scraper per URL
  • Index scraper
    • Scrapes ArmsList index for listing URLs
  • Listing scraper:
    • Takes URL and makes http request
    • Parses response with model class instance
    • Serializes model instance data to console

29 of 38

SCRAPER SCRIPT

30 of 38

3. Index scraper

  • Controller script:
    • Harvests listing URLs from site index with index scraper
    • Takes list of URLs, launches one listing scraper per URL
  • Index scraper
    • Scrapes ArmsList index for listing URLs
  • Listing scraper
    • Takes URL and makes http request
    • Parses response with model class instance
    • Serializes model instance data to console

31 of 38

We’ll skip the index scraper; it’s exactly the same idea as the listing scraper, but instead of URLs, it uses state names.

32 of 38

4. Controller script

  • Controller script:
    • Harvests listing URLs from site index with index scraper
    • Takes list of URLs, launches one listing scraper per URL
  • Index scraper
    • Scrapes ArmsList index for listing URLs
  • Listing scraper
    • Takes URL and makes http request
    • Parses response with model class instance
    • Serializes model instance data to console

33 of 38

Leveraging GNU Parallel

  • Parallelization in Python is a bit of a pain -- it works, but it’s easy enough to screw up.
  • GNU Parallel magically parallelizes any shell command
  • Handles stdin/stdout gracefully
  • Fits naturally into simple shell processing workflows

34 of 38

35 of 38

Scraping infrastructure

  • Cloud services like Amazon EC2 are very helpful
  • We used an Amazon c3.8xlarge
  • Scraped 80k URLs in about 16 minutes (about $0.50 in computing costs per scrape)
  • Don’t accidentally create a denial of service attack!

36 of 38

Proceed with care

  • It’s hard to know what level of traffic will know out a server
  • Increase number of parallel requests incrementally; back off if you notice ANY slowdown or hiccups
  • Be patient: Better to put a 1 second wait and have your scraper run for hours than the alternative

37 of 38

Putting late capitalist technical infrastructure monopolies to work

  • Always good to put your scraper near the site you’re scraping
  • Using MaxMind GeoIP lookup to choose our EC2 region, we realized the ArmsList.com servers are in the Amazon’s US East data center
  • By running our scraper in Amazon’s US East data center, we got lightning fast network performance

38 of 38

Finishing up

  • Publish the CSV with the story
  • Document the code
  • Todos for NPR Viz
    • Write tests (though at least we can)
    • Use incremental listing IDs instead of initial scrape
    • Other kinds of analysis