1 of 16

Web Scraping with Google Docs

Acton H. Gorton, University of Illinois

Sean Sposito, American Banker

Twitter:

@tin_acton

@seansposito

2 of 16

Background Information

Some Basics

  • Best if familiarity with HTML/XML
  • A way to look at HTML documents
  • Not afraid to get into a little bit of code

3 of 16

A bit about HTML/XML

HTML and XML follow the same rules for building web friendly documents:

Some basic HTML markup:

<html>

<head></head>

<body></body>

</html>

4 of 16

Looking at HTML documents

We need to understand the HTML structure.

Easy to do with Google Chrome or Firefox.

5 of 16

The Secret Sauce: XPath

View Raw HTML

  • Use Chrome's built in inspector
  • Use Firebug for Firefox

Find an XPath with a Tool

6 of 16

Google Chrome's Inspector

7 of 16

Questions so far?

We've talked about:

  • HTML/XML
  • Viewing HTML
  • Finding XPath

Next, I'll explain what we can do.

8 of 16

Demonstration

  • Choose a website:
    • City of Champaign, Bid Results
  • Locate XPath
    • Right (ctrl) click on target
    • Choose "Inspect Element"
    • Use "XPath" Extension for Chrome
    • Retrieve results from Console�
  • Use Google Docs to load results

9 of 16

Let's Walk Through Google Docs

  • Now its called, "Google Drive"� (whatever)
  • Free, cloud based�
  • Very similar to Microsoft Excel�
  • Uses "Functions" and "Scripts"

10 of 16

Create a Spreadsheet

  • Go to: drive.google.com
  • Click "Create" in top/left�
  • Choose "Spreadsheet"�
  • Look at "Help" section
    • Check out the "Function List"

11 of 16

What Function to Use?

Remember XML and XPath?

We're going to use the XPath we copied earlier with the "ImportXML" function to load the results into our spreadsheet.

12 of 16

ImportXML

=importXml("http://ci.champaign.il.us/departments/police/police-reports-and-statistics/";"//div[contains(concat(' ', @class, ' '), ' entry-content ')]//ul//li//a/@href ")

Let's break this down into its components:

Component

What it does

=

Tells Google Spreadsheet that a function is coming

importXml

The name of the function

( ) Parenthesis

What we want the function to do

" " Quotation Marks

Contains specific information for the function

URL

Where to find the information: http://archive.ci.champaign...

; Semicolon

Separates the specific information

Xpath

What to look for in the HTML from the given URL

13 of 16

Questions so far?

We've talked about:

  • Using XPath
  • Creating a Google Spreadsheet
  • Spreadsheet Functions

Next, we'll see some examples

14 of 16

Examples

Champaign City Website

  • https://docs.google.com/spreadsheet/ccc?key=0ArTSE1h9T9A5dERRY24xclk5S2ljWHVHYWVTcVBXYkE#gid=2

Create Notifications

  • Tools > Notification Rules > Set the Criteria and Save
  • Create a script to run the ImportXml function every hour
    • Tools > Script Editor
    • If first time, choose blank canvas
    • Give a name
    • Write a script to run the function - example on next slide

15 of 16

Checking a function script

This is a very crude script to activate a sheet and a specific cell within a spreadsheet:

function runFunction() {

var ss = SpreadsheetApp.getActiveSpreadsheet();

ss.setActiveSheet(ss.getSheets()[1]);

ss.B3;

}

// [1] is the sheet position, in order of 0,1,2,3,...

// B3 is the cell

script location

16 of 16

Helpful Links & Resources

Xpath Tutorial

http://www.w3schools.com/xpath/xpath_syntax.asp

Google Spreadsheet Functions

https://support.google.com/drive/bin/static.py?hl=en&topic=25273&page=table.cs