Demystifying Web Scraping
IRE 2012
Ted Han & Sean Sposito
Welcome to the Internet
publishing platform for the masses
Everything is published to the Internet
Well, not everything, but a lot.
WHAT YOU CAN SCRAPE
Non-profit disclosures
All this data is available if you speak HTTP+HTML
(well, or if you have tools that speak HTTP+HTML)
Tools that Speak HTTP+HTML
Or, a user's guide to web scraping and crawling tools.
Cultural Context
What you should know about the world of
Hypertext
Like text, you know, but hyper
HTTP
Hyper Text Transfer Protocol
(e.g. how web browsers and servers communicate)
Clients make requests
Requests are made in the form of a URL + some additional information
URLs specify what is being requested
http://ire.org/conferences/
scheme / protocol
Domain
Path
https://www.google.com/search?q=investigative+reporters+%26+editors&hl=en
Domain
Protocol
Path
Query
https://twitter.com/#!/knowtheory
Domain
Protocol
Path
Fragment Identifier
HTML
Hyper Text Markup Language
(The language websites are expressed in)
It's a markup language
Conceived of as a tool to annotate text
It's nested / tree-like
And we can point to branches in the tree
with tools like CSS selectors, or XPath
DOCS OVERVIEW
Google Docs is a suite of easy and powerful applications for the office, and this tip sheet will explain how to import data from sources on the Internet into a spreadsheet.
While Google Docs cannot currently compete with more powerful scripting languages such as Python or Rails, users of all skill levels are attracted to the service for its ease with creating and launching full featured applications.
Three Functions
ImportFEED
ImportHTML
ImportXML
FAKE XML
A typical book might be represented in XML like so:
<xml>
<book type="fiction">
<title>The Great Gatsby</title>
<author>
<firstName>Francis</firstName>
<lastName>Fitzgerald</lastName>
</author>
<yearPublished>1925</yearPublished>
</book>
<book type="fiction">
<title>Huckleberry Finn</title>
<author>
<firstName>Mark</firstName>
<lastName>Twain</lastName>
<firstName type="real">Samuel</firstName>
<lastName type="real">Clemens</lastName>
</author>
<yearPublished where="England">1884</yearPublished>
<yearPublished where="United States">1885</yearPublished>
</book>
</xml>
THE BREAK-DOWN
To break this down:
XPATH
Docs uses a language called XPath which addresses parts of an XML document
Now, if we use our Fake XML above and want to retrieve all of the information contained within the book element, we would type an XPath as the following:
//book
That will pull in all of the details: title, author, year of publication; but what if we just want to query the titles for the books?
//book/title
This will retrieve only the specific elements labeled as “title”: The Great Gatsby, Huckleberry Finn
Maybe there are elements containing information we don’t want, so we look for an attribute one element has but another does not. Our XML contains several years identifying publication, but we want to know the publication dates for books published in England:
//book/yearPublished[@where=”England”]
Google Docs
In Docs we would use a the importXML query to bring in the data:
=importxml(FAKEXMLURL,"//book/yearPublished[@where='England']")
source: Distilled
DOCS RESOURCES
MORE LINKS
The Simple Way to Scrape an HTML Table: Google Docs (Via EagerEyes.Org)
SCRAPING DATA FROM A LIST OF WEBPAGES USING GOOGLE DOCS (Via OnlineJournalismBlog.Org)
A MUCH BETTER SLIDE SHOW AND TUTORIAL ON GOOGLE DOCS WEB SCRAPING FROM
Via: SEER Interactive
AND:
The ImportXML Guide for Google Docs
XPATH GENERATOR FOR GOOGLE CHROME!!!!
Even more links and resources