1 of 31

Demystifying Web Scraping

IRE 2012

Ted Han & Sean Sposito

2 of 31

Welcome to the Internet

publishing platform for the masses

3 of 31

Everything is published to the Internet

Well, not everything, but a lot.

4 of 31

WHAT YOU CAN SCRAPE

  • Government Databases
  • Crowd sourced content (IMDB, Wikipedia, Reddit)
  • Blogs
  • Newspaper Websites

Non-profit disclosures

5 of 31

All this data is available if you speak HTTP+HTML

(well, or if you have tools that speak HTTP+HTML)

6 of 31

Tools that Speak HTTP+HTML

  • Web Servers
  • Web Browsers and their web inspectors
  • Software tools such as, Google Docs, Outwit Hub or browser plugins
  • Programming libraries like jQuery, Mechanize, Scrapy, and many more.

7 of 31

Or, a user's guide to web scraping and crawling tools.

8 of 31

Cultural Context

What you should know about the world of

9 of 31

Hypertext

Like text, you know, but hyper

10 of 31

HTTP

Hyper Text Transfer Protocol

(e.g. how web browsers and servers communicate)

11 of 31

Clients make requests

Requests are made in the form of a URL + some additional information

12 of 31

URLs specify what is being requested

13 of 31

http://ire.org/conferences/

scheme / protocol

Domain

Path

14 of 31

https://www.google.com/search?q=investigative+reporters+%26+editors&hl=en

Domain

Protocol

Path

Query

15 of 31

https://twitter.com/#!/knowtheory

Domain

Protocol

Path

Fragment Identifier

16 of 31

HTML

Hyper Text Markup Language

(The language websites are expressed in)

17 of 31

It's a markup language

Conceived of as a tool to annotate text

18 of 31

It's nested / tree-like

And we can point to branches in the tree

with tools like CSS selectors, or XPath

19 of 31

DOCS OVERVIEW

Google Docs is a suite of easy and powerful applications for the office, and this tip sheet will explain how to import data from sources on the Internet into a spreadsheet.

While Google Docs cannot currently compete with more powerful scripting languages such as Python or Rails, users of all skill levels are attracted to the service for its ease with creating and launching full featured applications.

  • Beginning data workers will find plenty of tutorials within the documentation for adding basic functionality to their spreadsheets and mashing data up in new ways.�
  • Experienced users of spreadsheet software such as Microsoft Excel will appreciate the fluid crossover of services and the familiar workings of various spreadsheet functions.�
  • Advanced users versed in programming or scripting languages can harness the most powerful features of Google Docs and Apps Scripts for RESTful interaction with websites and creating rapid prototypes of software applications.

20 of 31

Three Functions

ImportFEED

ImportHTML

ImportXML

21 of 31

22 of 31

23 of 31

24 of 31

FAKE XML

A typical book might be represented in XML like so:

<xml>

<book type="fiction">

<title>The Great Gatsby</title>

<author>

<firstName>Francis</firstName>

<lastName>Fitzgerald</lastName>

</author>

<yearPublished>1925</yearPublished>

</book>

<book type="fiction">

<title>Huckleberry Finn</title>

<author>

<firstName>Mark</firstName>

<lastName>Twain</lastName>

<firstName type="real">Samuel</firstName>

<lastName type="real">Clemens</lastName>

</author>

<yearPublished where="England">1884</yearPublished>

<yearPublished where="United States">1885</yearPublished>

</book>

</xml>

25 of 31

THE BREAK-DOWN

To break this down:

  • <book> is an element; all elements must have a beginning and an ending and elements can be “nested” within other elements.�� The nested elements are sometimes referred to as existing within a “tree” - each nested element representing a tree branch.�
  • type=”fictionis an attribute; this describes something about the element.�� There can be several attributes within an element, and any name can be given depending on the Document Type Definition or other user defined rules.�
  • text is contained between the elemental tags; this is represented by the name of the book (The Great Gatsby), the author (Fitzgerald) and year published (1925).

26 of 31

XPATH

Docs uses a language called XPath which addresses parts of an XML document

Now, if we use our Fake XML above and want to retrieve all of the information contained within the book element, we would type an XPath as the following:

//book

That will pull in all of the details: title, author, year of publication; but what if we just want to query the titles for the books?

//book/title

This will retrieve only the specific elements labeled as “title”: The Great Gatsby, Huckleberry Finn

Maybe there are elements containing information we don’t want, so we look for an attribute one element has but another does not. Our XML contains several years identifying publication, but we want to know the publication dates for books published in England:

//book/yearPublished[@where=”England”]

27 of 31

Google Docs

In Docs we would use a the importXML query to bring in the data:

=importxml(FAKEXMLURL,"//book/yearPublished[@where='England']")

  • // – this means select all elements of the type<
  • //book – this means select all book elements
  • /yearPublished -- this means select all yearPublished
  • [@where=''] – this means only select those elements that meet the criteria given – Example: look for yearPublished
  • //book/yearPublished[@where='England'] – this means only select elements that look like:book yearPublished where=”England”

source: Distilled

28 of 31

29 of 31

DOCS RESOURCES

  • A quick link to the underlying scripts functions and tutorials that powers Google Docs
  • A primer for learning about and referencing the XML based query tool: XPath.
  • How to make sense of JSON
    • http://json.org/�
  • Firebug is a very popular browser add-on for exploring web pages in depth and seeing how information is organized within a web page. It makes creating XPath queries easier.
    • http://getfirebug.com�
  • A tutorial for importing using ImportXML and XPath
  • Sending notifications when changes occur to Google Doc’s Spreadsheets
  • App Sumo: Google Docs Unleashed ($25 to learn awesome stuff)

30 of 31

MORE LINKS

31 of 31

Even more links and resources