1 of 24

Web scraping with cheerio

Morgan Conrad flyingspaniel.com

morganconrad@yahoo.com www.nextq.info

2 of 24

Web Scraping

  • In an ideal world, all websites have a REST API.
    • Consistent, supported
    • But APIs go away (Google) or get limited (Twitter)
    • Lots of “small operators” don’t have them
  • Scraping is pulling less-structured data from the HTML
    • hopefully turning it into better structured data!
  • Some legal issues. Be polite - cache!
    • Facts can’t be copyrighted, expression of them can
    • Consider contacting the source for permission

3 of 24

Overview Cheerio helps with the red

  1. Figure out the URL to fetch the data
    1. What query terms are required? Format?
    2. Pagination can be an issue
  2. Stare in terror at the poorly formed HTML
    • Figure out the structure, and format of data (e.g. dates!).
    • Search source text or use Developer Tools.
  3. Write code
  4. Fetch the URL (use http or request)
  5. Fix any bad HTML
  6. Load / parse HTML
  7. Walk the DOM or use CSS Selectors

4 of 24

Brief intro to Cheerio

“Tiny, fast, and elegant implementation of core jQuery designed specifically for the server”

competitor to JSDOM

Interesting structure and implementation, that’s for another talk...

v 0.17.0

dependencies: CSSselect, entities, htmlparser2, dom-serializer, lodash dependents: 1000! including a gulp-cheerio

5 of 24

Traversing with Selectors

doc(selector, [context], [root]) // from the top

or co.find(selector) // “O-O” way from selection

co.children(selector) // only direct descendents

var table= doc('.fooClass[bgcolor="#71828A"]');

var trs = doc('tr[bgcolor="#FFFFFF"]', table);

or trs = table.find('tr[bgcolor="#FFFFFF"]');

6 of 24

“Walking the DOM”

.root()

.parent([selector]), .parents([selector])

.next([selector]), .prev([selector])

.slice(), .siblings(), ...more…

Sometimes these are simplest.

IMO, you use jQuery is to avoid these.

see: http://radar.oreilly.com/2013/05/css-selectors-as-superpowers.html

7 of 24

Traversing Functionally - cool!

.each( function(index, element))

.map( function(index, element))

.filter( selector),

.filter( function(index))

also

.eq(index) returns element, index may be < 0

.first(), .last(), ...

8 of 24

Reading / optionally Writing Values

.attr(name [,newValue]) (if newValue it writes)

.data(name [,newValue])

.val([newValue]) for input, select, textarea

.html([newValue]) good for rendering/debugging

.text([newValue])

.css([various options]) only local (in .html, not .css)

Also methods to change the DOM (.append, .clone)

9 of 24

Scrape AKC agility trials

Go to akc.org

-> Dog Shows & Trials

-> Event Calendar

(http://www.apps.akc.org/apps/events/search/index.cfm)

Click “Event Search” Tab

10 of 24

Competition Type = Agility Trials

Time Range = All Future Events

(or Current Calendar Year - a.k.a. pagination)

Select a few states

Click search

  • AKC is nice, a new window pops up with URL in title
    • Sometimes it is internal, monitor network packets
  • http://www.apps.akc.org/apps/events/search/blocks/dsp_event_list.cfm?active_tab_row=2&active_tab_col=4&fixed_id=12&club_name=&date_range=CURRYR&event_grouping=AG&save_as_default=Y&saved_states=&select_all=&states=CA
  • If you wish, experiment with dropping apparently unnecessary terms from the query.

11 of 24

12 of 24

Select table and get rows

Find table within global context.

Click around with Developer Tools or

Search Page Source for “Samoyed”.

Double check - there are two table with class=qs_table!

Luckily, one has a distinctive bgcolor.

Data is in the table rows with a white background.

13 of 24

Back to Cheerio

var Cheerio = require('cheerio')

Request.get(url, function (err, resp, body) {

var doc = Cheerio.load(body);

var t2 = doc('.qs_table[bgcolor="#71828A"]');

var trs= t2.find('tr[bgcolor="#FFFFFF"]');

// now what?

}

14 of 24

Iterate using .each() function

var shows = [];

trs.each(function (i, tablerow) {

var show = parseShow(i, tablerow);

if (show) // some rows are garbage...

shows.push(show);

});

15 of 24

AKC.parseShow(index, tableRow)

// find the tds for the row. 3rd is the date

var tds = Cheerio(tableRow).children('td');

var dateTD = tds[3];

// could use a utility with .children[], .data,

var dateS= MyUtils.text(tds[3]).trim();

show.dates = [new Date(dateS)];

16 of 24

Or stay within Cheerio

  • more robust if they add a <b>

var tds = Cheerio(tableRow).children('td');

var td3 = tds.eq(3);

var dateStr = td3.text().trim();

show.dates = [new Date(dateS)];

17 of 24

Get more info from columns 0 and 1

  • Code omitted, column 0 is especially messy, I wrote utilities to work with the low level stuff
  • If date is < today, or there is an error, return null, else return a Show which gets added to shows[ ]
  • Dates are messy to parse, different sites use different conventions, esp. for multi-day events. AKC was easy because all it’s events are one-day.
  • shows[ ] eventually goes into MongoDB. Only refresh from the AKC web site every couple of days.

18 of 24

Scrape UKI Agility International

19 of 24

Table is class ctl00_cpMain_tblShows

Odd numbered rows have data

Even rows are the headers and dividing lines

20 of 24

Find the table we want, iterate

tbl = doc('table[id="ctl00_cpMain_tblShows"]');

tbl.find('tr').each(function (i, tr) {

if (i % 2 === 1) { // only odd rows

var show = parseShow(i, tr);

if (show) shows.push(show);

}

});

21 of 24

UKI.parseShow(index, tableRow)

var show = new Show(UKI.orgKey);

var tds = Cheerio(tr).children('td');

show.dates= parseDates(tds.eq(0).text());

show.club = tds.eq(1).text().trim();

show.location = tds.eq(2).text().trim();

var link = tds.eq(5).find('a');

show.urls = [UKI_ROOT + link.attr('href')];

22 of 24

Summary

  • You must figure out messy HTML structure
  • If “they” change structure you are hosed
    • CSS selectors makes searching more robust
  • Dates (and locations) will be an issue
  • Cheerio is very jQuery-like, easy to transverse the DOM
  • A few gotchas on return types
  • Haven’t tried modifying the HTML

23 of 24

The final product

  1. Goto www.nextq.info
  2. Select 1-5 clubs
  3. Select 1-10 states
  4. Click Search

REST: add a “/v1” before the “/ag” in the URL to see JSON

Don’t force others to scrape you!

24 of 24

Questions?

Thanks for your attention.

Thanks to Ross for some good editorial suggestions

Morgan Conrad flyingspaniel.com

morganconrad@yahoo.com www.nextq.info