I Need Data for My Story.
Help.
http://bit.ly/ona-data-help
Three common scenarios
I don’t know if the data exists
Data exists but what format do I
want it in?
Data exists but it’s playing hard-to-get
I don’t know if the data exists
Make a phone call. Send an email. Make a friend.
The data evangelist at an agency is one of the best ways to learn if data exists and how to access it.
Find the form to submit or search.
Submission forms don't work without a database. A database needs data.
Search.
Advanced search. Site search. File search. Site Map.
I don’t know if the data exists
Has someone else been stuck like you?
I don’t know if the data exists
Has someone else already found your data?
I don’t know if the data exists
Data exists but what �format do I want it in?
Know your spreadsheets.
.csv: Plain text files that store information in simple formats. Each line is a record, and each field within a record is separated by a comma. You can open a .csv in a text editor or in Excel/Open Office, or upload to a database like MySQL or Postgres.
.xls/.xlsx: When we think of spreadsheets, most of the time we're thinking of the files that Excel produces.
Data exists but what �format do I want it in?
Data exists but what �format do I want it in?
PDFs aren't bad, just cumbersome.
Native PDFs: Usually saved out of an application like Word or Excel as a PDF. Working with these is fairly straightforward and Optical Character Recognition should work for you.
Scanned PDFs: If the data or document you requested is coming from a paper record it likely will be scanned in and saved as a PDF. Knowing this ahead of time will save you time and bring less grief.
Data exists but what �format do I want it in?
Get out of my PDF and into my spreadsheet.
Tabula: Free and open source thanks to the Mozilla Open News fellows.
ScraperWiki: Now with a PDF table extractor.
Document Cloud: Not so much for spreadsheet-type data, but something you should be using. Free and open source thanks to IRE.
CometDocs: Online document converter. IRE members get a free premium account.
PDF to Excel: From a company called Wondershare. Free trial, otherwise $20 and does a good job.
Data exists but what �format do I want it in?
Chasin' JSON for news applications.
JSON: Data format used to store data for web development. It stores data as a series of keys that contain values.
Data exists but what �format do I want it in?
Get your data tidy and clean.
Open Refine: A masterful application used to clean, filter, merge and export datasets.
Mr. Data Converter: Converts Excel or .csv files into HTML or JSON.
Data Wrangler: Interactive tool for data cleaning and transformation.
csvkit: An open-source tool for analyzing and sorting .csv files written in Python.
Data exists but it’s playing hard-to-get.
Becoming a web detective
Data exists but it’s playing hard-to-get.
Becoming a web detective
Data exists but it’s playing hard-to-get.
Let’s try it!
Data exists but it’s playing hard-to-get.
The web detective’s toolkit:
Right-click, view source
Chrome/Firefox web consoles
WireShark: http://www.wireshark.org/
Fiddler: http://www.fiddler2.com/
Data exists but it’s playing hard-to-get.
APIs
Websites: beautiful to humans, ugly to computers
Data exists but it’s playing hard-to-get.
APIs
API results: ugly to humans, beautiful to computers
Data exists but it’s playing hard-to-get.
APIs are a limited guest pass into someone else’s database.
APIs are good for:
Getting data in bulk
Getting data that’s otherwise not public
Getting live data
Data exists but it’s playing hard-to-get.
APIs - the general idea
Send a request, get a response.
Make your request more specific with parameters.
Read the documentation, or specification for details.
Data exists but it’s playing hard-to-get.
APIS - GET vs. POST
GET requests can be done from your browser.
https://api.twitter.com/1.1/search/tweets.json?q=burritos&lang=en&count=100
Parameters:
q=burritos
lang=en
count=100
Data exists but it’s playing hard-to-get.
APIS - GET vs. POST
POST requests send parameters invisibly. They require a little bit of code or other tools.
https://api.twitter.com/1.1/search/tweets.json
Parameters are not part of the URL.
Data exists but it’s playing hard-to-get.
Caveats:
They can be changed or shut down at any time.
Sometimes they cost money.
Sometimes there are rate limits.
Sometimes you need a key.
They usually speak JSON or XML.
Data exists but it’s playing hard-to-get.
Who has an API?
New York Times, Yelp, Twitter, Flickr, Foursquare, Instagram, LinkedIn, Vimeo, Tumblr, Facebook, Google+, YouTube, and many more
http://www.programmableweb.com/apis/directory/
Data exists but it’s playing hard-to-get.
If an API is a guest pass, a scraper is a crowbar.
Data exists but it’s playing hard-to-get.
Scraping - prerequisites
A scripting language of choice (Python, Ruby, PHP)
Familiarity with HTML and CSS
Regular expressions, if you’re unlucky
^[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}$
Data exists but it’s playing hard-to-get.
Scraping - the diligent idiot
1. Find the table
2. Find every row in the table
3. Find the third cell in each row
4. Print the contents of that cell
Data exists but it’s playing hard-to-get.
Scraping - Python flavor
table = page.find("table")
for row in table.find_all("tr"):
third_cell = table.find_all("td")[2]
print third_cell.text
Data exists but it’s playing hard-to-get.
Scraping - PHP flavor
$table = $page->find("table");
foreach ($table->find("tr") as $row) {
$third_cell = $row->find("td")[2];
echo $third_cell->innertext;
}
Data exists but it’s playing hard-to-get.
Scraping - Ruby flavor
table = page.search("table")
table.search("tr").each do |row|
third_cell = row.search("td")[2]
print third_cell.inner_html
end
Data exists but it’s playing hard-to-get.
Scraping - false positives vs. false negatives
^[A-Z][a-z]+\s[A-Z][a-z]+$
Will match a first name and last name.
False positives False negatives
South Dakota Jean-Claude Van Damme
Lockheed Martin Ian McKellen
Christmas Eve Bono
Boston Globe Sammy Davis, Jr.
Data exists but it’s playing hard-to-get.
Always
Be
Checking
Spot-
Data exists but it’s playing hard-to-get.
Easier to scrape
Single page
Public
Static
Lists and tables
Mobile-optimized
Harder to scrape
Multiple pages
Login required
JavaScript-heavy
Flash
Complex text patterns
Data exists but it’s playing hard-to-get.
ScraperWiki
Data exists but it’s playing hard-to-get.
ScraperWiki
Large library of existing scrapers.
No server required.
Download results in a table.
Usually requires some coding.
Sometimes costs money.
Data exists but it’s playing hard-to-get.
Web tables into Excel
Data exists but not all in one place.
Crowdsourcing
Ask the audience
Twitter hivemind
Google forms
Mechanical Turk/Crowdflower ($$$)
Data exists but not all in one place.
Free The Files (ProPublica)
Data exists but not all in one place.
Data exists but not all in one place.
Beware of biases and errors introduced by crowdsourcing.
Data exists but not all in one place.
Beware of biases and errors introduced by crowdsourcing.
Data exists but not all in one place.
Beware of biases and errors introduced by crowdsourcing.
Data exists but not all in one place.
Don’t be afraid to bake data from scratch.
(Data always tastes better when it’s homemade)
Data exists but not all in one place.
Don’t be afraid to bake data from scratch.
(Data always tastes better when it’s homemade)
Questions?
Chris Keller Noah Veltman
@ChrisLKeller @veltman
http://bit.ly/ona-data-help