JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 31

Demystifying Web Scraping

IRE 2012

Ted Han & Sean Sposito

2 of 31

Welcome to the Internet

publishing platform for the masses

3 of 31

Everything is published to the Internet

Well, not everything, but a lot.

4 of 31

WHAT YOU CAN SCRAPE

Government Databases
Crowd sourced content (IMDB, Wikipedia, Reddit)
Blogs
Newspaper Websites

Non-profit disclosures

5 of 31

All this data is available if you speak HTTP+HTML

(well, or if you have tools that speak HTTP+HTML)

6 of 31

Tools that Speak HTTP+HTML

Web Servers
Web Browsers and their web inspectors
Software tools such as, Google Docs, Outwit Hub or browser plugins
Programming libraries like jQuery, Mechanize, Scrapy, and many more.

7 of 31

Translating the Internet

Or, a user's guide to web scraping and crawling tools.

8 of 31

Cultural Context

What you should know about the world of

9 of 31

Hypertext

Like text, you know, but hyper

10 of 31

HTTP

Hyper Text Transfer Protocol

(e.g. how web browsers and servers communicate)

11 of 31

Clients make requests

Requests are made in the form of a URL + some additional information

12 of 31

URLs specify what is being requested

13 of 31

http://ire.org/conferences/

scheme / protocol

Domain

Path

14 of 31

https://www.google.com/search?q=investigative+reporters+%26+editors&hl=en

Domain

Protocol

Path

Query

15 of 31

https://twitter.com/#!/knowtheory

Domain

Protocol

Path

Fragment Identifier

16 of 31

HTML

Hyper Text Markup Language

(The language websites are expressed in)

17 of 31

It's a markup language

Conceived of as a tool to annotate text

18 of 31

It's nested / tree-like

And we can point to branches in the tree

with tools like CSS selectors, or XPath

19 of 31

DOCS OVERVIEW

Google Docs is a suite of easy and powerful applications for the office, and this tip sheet will explain how to import data from sources on the Internet into a spreadsheet.

While Google Docs cannot currently compete with more powerful scripting languages such as Python or Rails, users of all skill levels are attracted to the service for its ease with creating and launching full featured applications.

Beginning data workers will find plenty of tutorials within the documentation for adding basic functionality to their spreadsheets and mashing data up in new ways.�
Experienced users of spreadsheet software such as Microsoft Excel will appreciate the fluid crossover of services and the familiar workings of various spreadsheet functions.�
Advanced users versed in programming or scripting languages can harness the most powerful features of Google Docs and Apps Scripts for RESTful interaction with websites and creating rapid prototypes of software applications.

20 of 31

Three Functions

ImportFEED

ImportHTML

ImportXML

21 of 31

=importFEED

22 of 31

=importHTML

23 of 31

=importXML

24 of 31

FAKE XML

A typical book might be represented in XML like so:

<xml>

<title>The Great Gatsby</title>

<firstName>Francis</firstName>

<lastName>Fitzgerald</lastName>

</author>

</book>

<title>Huckleberry Finn</title>

<lastName>Twain</lastName>

<firstName type="real">Samuel</firstName>

<lastName type="real">Clemens</lastName>

</author>

</book>

</xml>

25 of 31

THE BREAK-DOWN

To break this down:

<book> is an element; all elements must have a beginning and an ending and elements can be “nested” within other elements.�� The nested elements are sometimes referred to as existing within a “tree” - each nested element representing a tree branch.�
type=”fiction” is an attribute; this describes something about the element.�� There can be several attributes within an element, and any name can be given depending on the Document Type Definition or other user defined rules.�
text is contained between the elemental tags; this is represented by the name of the book (The Great Gatsby), the author (Fitzgerald) and year published (1925).

26 of 31

XPATH

Docs uses a language called XPath which addresses parts of an XML document

Now, if we use our Fake XML above and want to retrieve all of the information contained within the book element, we would type an XPath as the following:

//book

That will pull in all of the details: title, author, year of publication; but what if we just want to query the titles for the books?

//book/title

This will retrieve only the specific elements labeled as “title”: The Great Gatsby, Huckleberry Finn

Maybe there are elements containing information we don’t want, so we look for an attribute one element has but another does not. Our XML contains several years identifying publication, but we want to know the publication dates for books published in England:

//book/yearPublished[@where=”England”]

27 of 31

Google Docs

In Docs we would use a the importXML query to bring in the data:

=importxml(FAKEXMLURL,"//book/yearPublished[@where='England']")

// – this means select all elements of the type<
//book – this means select all book elements
/yearPublished -- this means select all yearPublished
[@where=''] – this means only select those elements that meet the criteria given – Example: look for yearPublished
//book/yearPublished[@where='England'] – this means only select elements that look like:book yearPublished where=”England”

source: Distilled

28 of 31

THE WEATHER EXAMPLE

29 of 31

DOCS RESOURCES

A quick link to the underlying scripts functions and tutorials that powers Google Docs

http://code.google.com/googleapps/appsscript/articles/appengine.html�

A primer for learning about and referencing the XML based query tool: XPath.

http://www.w3schools.com/xpath/�

How to make sense of JSON

http://json.org/�

Firebug is a very popular browser add-on for exploring web pages in depth and seeing how information is organized within a web page. It makes creating XPath queries easier.

http://getfirebug.com�

A tutorial for importing using ImportXML and XPath

http://www.distilled.net/blog/distilled/guide-to-google-docs-importxml/�

Sending notifications when changes occur to Google Doc’s Spreadsheets

http://www.labnol.org/internet/monitor-web-pages-changes-with-google-docs/4536/�

App Sumo: Google Docs Unleashed ($25 to learn awesome stuff)

http://www.appsumo.com/googledocsunleashed/

30 of 31

31 of 31

Even more links and resources

FDA Import Refusal scraper
FDA Import Refusal Scraper in ScraperWiki
The Google Chrome Scraper plugin
Another xpath tutorial