1 of 45

I Need Data for My Story.

Help.

http://bit.ly/ona-data-help

2 of 45

Three common scenarios

I don’t know if the data exists

Data exists but what format do I

want it in?

Data exists but it’s playing hard-to-get

3 of 45

I don’t know if the data exists

Make a phone call. Send an email. Make a friend.

The data evangelist at an agency is one of the best ways to learn if data exists and how to access it.

Find the form to submit or search.

Submission forms don't work without a database. A database needs data.

Search.

Advanced search. Site search. File search. Site Map.

4 of 45

I don’t know if the data exists

Has someone else been stuck like you?

5 of 45

I don’t know if the data exists

Has someone else already found your data?

6 of 45

I don’t know if the data exists

Has someone else already bookmarked your data?

bit.ly/ona-data-sources

7 of 45

Data exists but what �format do I want it in?

Know your spreadsheets.

.csv: Plain text files that store information in simple formats. Each line is a record, and each field within a record is separated by a comma. You can open a .csv in a text editor or in Excel/Open Office, or upload to a database like MySQL or Postgres.

.xls/.xlsx: When we think of spreadsheets, most of the time we're thinking of the files that Excel produces.

8 of 45

Data exists but what �format do I want it in?

9 of 45

Data exists but what �format do I want it in?

PDFs aren't bad, just cumbersome.

Native PDFs: Usually saved out of an application like Word or Excel as a PDF. Working with these is fairly straightforward and Optical Character Recognition should work for you.

Scanned PDFs: If the data or document you requested is coming from a paper record it likely will be scanned in and saved as a PDF. Knowing this ahead of time will save you time and bring less grief.

10 of 45

Data exists but what �format do I want it in?

Get out of my PDF and into my spreadsheet.

Tabula: Free and open source thanks to the Mozilla Open News fellows.

ScraperWiki: Now with a PDF table extractor.

Document Cloud: Not so much for spreadsheet-type data, but something you should be using. Free and open source thanks to IRE.

CometDocs: Online document converter. IRE members get a free premium account.

PDF to Excel: From a company called Wondershare. Free trial, otherwise $20 and does a good job.

11 of 45

Data exists but what �format do I want it in?

Chasin' JSON for news applications.

JSON: Data format used to store data for web development. It stores data as a series of keys that contain values.

12 of 45

Data exists but what �format do I want it in?

Get your data tidy and clean.

Open Refine: A masterful application used to clean, filter, merge and export datasets.

Mr. Data Converter: Converts Excel or .csv files into HTML or JSON.

Data Wrangler: Interactive tool for data cleaning and transformation.

csvkit: An open-source tool for analyzing and sorting .csv files written in Python.

13 of 45

Data exists but it’s playing hard-to-get.

Becoming a web detective

14 of 45

Data exists but it’s playing hard-to-get.

Becoming a web detective

15 of 45

Data exists but it’s playing hard-to-get.

Let’s try it!

16 of 45

Data exists but it’s playing hard-to-get.

The web detective’s toolkit:

Right-click, view source

Chrome/Firefox web consoles

WireShark: http://www.wireshark.org/

Fiddler: http://www.fiddler2.com/

17 of 45

Data exists but it’s playing hard-to-get.

APIs

Websites: beautiful to humans, ugly to computers

18 of 45

Data exists but it’s playing hard-to-get.

APIs

API results: ugly to humans, beautiful to computers

19 of 45

Data exists but it’s playing hard-to-get.

APIs are a limited guest pass into someone else’s database.

APIs are good for:

Getting data in bulk

Getting data that’s otherwise not public

Getting live data

20 of 45

Data exists but it’s playing hard-to-get.

APIs - the general idea

Send a request, get a response.

Make your request more specific with parameters.

Read the documentation, or specification for details.

21 of 45

Data exists but it’s playing hard-to-get.

APIS - GET vs. POST

GET requests can be done from your browser.

https://api.twitter.com/1.1/search/tweets.json?q=burritos&lang=en&count=100

Parameters:

q=burritos

lang=en

count=100

22 of 45

Data exists but it’s playing hard-to-get.

APIS - GET vs. POST

POST requests send parameters invisibly. They require a little bit of code or other tools.

https://api.twitter.com/1.1/search/tweets.json

Parameters are not part of the URL.

23 of 45

Data exists but it’s playing hard-to-get.

Caveats:

They can be changed or shut down at any time.

Sometimes they cost money.

Sometimes there are rate limits.

Sometimes you need a key.

They usually speak JSON or XML.

24 of 45

Data exists but it’s playing hard-to-get.

Who has an API?

New York Times, Yelp, Twitter, Flickr, Foursquare, Instagram, LinkedIn, Vimeo, Tumblr, Facebook, Google+, YouTube, and many more

http://www.programmableweb.com/apis/directory/

25 of 45

Data exists but it’s playing hard-to-get.

If an API is a guest pass, a scraper is a crowbar.

26 of 45

Data exists but it’s playing hard-to-get.

Scraping - prerequisites

A scripting language of choice (Python, Ruby, PHP)

Familiarity with HTML and CSS

Regular expressions, if you’re unlucky

^[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}$

27 of 45

Data exists but it’s playing hard-to-get.

Scraping - the diligent idiot

1. Find the table

2. Find every row in the table

3. Find the third cell in each row

4. Print the contents of that cell

28 of 45

Data exists but it’s playing hard-to-get.

Scraping - Python flavor

table = page.find("table")

for row in table.find_all("tr"):

third_cell = table.find_all("td")[2]

print third_cell.text

29 of 45

Data exists but it’s playing hard-to-get.

Scraping - PHP flavor

$table = $page->find("table");

foreach ($table->find("tr") as $row) {

$third_cell = $row->find("td")[2];

echo $third_cell->innertext;

}

30 of 45

Data exists but it’s playing hard-to-get.

Scraping - Ruby flavor

table = page.search("table")

table.search("tr").each do |row|

third_cell = row.search("td")[2]

print third_cell.inner_html

end

31 of 45

Data exists but it’s playing hard-to-get.

Scraping - false positives vs. false negatives

^[A-Z][a-z]+\s[A-Z][a-z]+$

Will match a first name and last name.

False positives False negatives

South Dakota Jean-Claude Van Damme

Lockheed Martin Ian McKellen

Christmas Eve Bono

Boston Globe Sammy Davis, Jr.

32 of 45

Data exists but it’s playing hard-to-get.

Always

Be

Checking

Spot-

33 of 45

Data exists but it’s playing hard-to-get.

Easier to scrape

Single page

Public

Static

Lists and tables

Mobile-optimized

Harder to scrape

Multiple pages

Login required

JavaScript-heavy

Flash

Complex text patterns

34 of 45

Data exists but it’s playing hard-to-get.

ScraperWiki

35 of 45

Data exists but it’s playing hard-to-get.

ScraperWiki

Large library of existing scrapers.

No server required.

Download results in a table.

Usually requires some coding.

Sometimes costs money.

36 of 45

Data exists but it’s playing hard-to-get.

Web tables into Excel

37 of 45

Data exists but not all in one place.

Crowdsourcing

Ask the audience

Twitter hivemind

Google forms

Mechanical Turk/Crowdflower ($$$)

38 of 45

Data exists but not all in one place.

Free The Files (ProPublica)

39 of 45

Data exists but not all in one place.

You can play, too!

https://github.com/propublica/transcribable

40 of 45

Data exists but not all in one place.

Beware of biases and errors introduced by crowdsourcing.

41 of 45

Data exists but not all in one place.

Beware of biases and errors introduced by crowdsourcing.

42 of 45

Data exists but not all in one place.

Beware of biases and errors introduced by crowdsourcing.

43 of 45

Data exists but not all in one place.

Don’t be afraid to bake data from scratch.

(Data always tastes better when it’s homemade)

44 of 45

Data exists but not all in one place.

Don’t be afraid to bake data from scratch.

(Data always tastes better when it’s homemade)

45 of 45

Questions?

Chris Keller Noah Veltman

@ChrisLKeller @veltman

http://bit.ly/ona-data-help