Web scraping with cheerio
Morgan Conrad flyingspaniel.com
morganconrad@yahoo.com www.nextq.info
Web Scraping
Overview Cheerio helps with the red
Brief intro to Cheerio
“Tiny, fast, and elegant implementation of core jQuery designed specifically for the server”
competitor to JSDOM
Interesting structure and implementation, that’s for another talk...
v 0.17.0
dependencies: CSSselect, entities, htmlparser2, dom-serializer, lodash dependents: 1000! including a gulp-cheerio
Traversing with Selectors
doc(selector, [context], [root]) // from the top
or co.find(selector) // “O-O” way from selection
co.children(selector) // only direct descendents
var table= doc('.fooClass[bgcolor="#71828A"]');
var trs = doc('tr[bgcolor="#FFFFFF"]', table);
or trs = table.find('tr[bgcolor="#FFFFFF"]');
“Walking the DOM”
.root()
.parent([selector]), .parents([selector])
.next([selector]), .prev([selector])
.slice(), .siblings(), ...more…
Sometimes these are simplest.
IMO, you use jQuery is to avoid these.
see: http://radar.oreilly.com/2013/05/css-selectors-as-superpowers.html
Traversing Functionally - cool!
.each( function(index, element))
.map( function(index, element))
.filter( selector),
.filter( function(index))
also
.eq(index) returns element, index may be < 0
.first(), .last(), ...
Reading / optionally Writing Values
.attr(name [,newValue]) (if newValue it writes)
.data(name [,newValue])
.val([newValue]) for input, select, textarea
.html([newValue]) good for rendering/debugging
.text([newValue])
.css([various options]) only local (in .html, not .css)
Also methods to change the DOM (.append, .clone)
Scrape AKC agility trials
Go to akc.org
-> Dog Shows & Trials
-> Event Calendar
(http://www.apps.akc.org/apps/events/search/index.cfm)
Click “Event Search” Tab
Competition Type = Agility Trials
Time Range = All Future Events
(or Current Calendar Year - a.k.a. pagination)
Select a few states
Click search
Select table and get rows
Find table within global context.
Click around with Developer Tools or
Search Page Source for “Samoyed”.
Double check - there are two table with class=qs_table!
Luckily, one has a distinctive bgcolor.
Data is in the table rows with a white background.
Back to Cheerio
var Cheerio = require('cheerio')
Request.get(url, function (err, resp, body) {
var doc = Cheerio.load(body);
var t2 = doc('.qs_table[bgcolor="#71828A"]');
var trs= t2.find('tr[bgcolor="#FFFFFF"]');
// now what?
}
Iterate using .each() function
var shows = [];
trs.each(function (i, tablerow) {
var show = parseShow(i, tablerow);
if (show) // some rows are garbage...
shows.push(show);
});
AKC.parseShow(index, tableRow)
// find the tds for the row. 3rd is the date
var tds = Cheerio(tableRow).children('td');
var dateTD = tds[3];
// could use a utility with .children[], .data,
var dateS= MyUtils.text(tds[3]).trim();
show.dates = [new Date(dateS)];
Or stay within Cheerio
var tds = Cheerio(tableRow).children('td');
var td3 = tds.eq(3);
var dateStr = td3.text().trim();
show.dates = [new Date(dateS)];
Get more info from columns 0 and 1
Scrape UKI Agility International
https://www.ukagilityinternational.com/
Events -> UKI Trials
https://www.ukagilityinternational.com/ShowDiary.aspx
Table is class ctl00_cpMain_tblShows
Odd numbered rows have data
Even rows are the headers and dividing lines
Find the table we want, iterate
tbl = doc('table[id="ctl00_cpMain_tblShows"]');
tbl.find('tr').each(function (i, tr) {
if (i % 2 === 1) { // only odd rows
var show = parseShow(i, tr);
if (show) shows.push(show);
}
});
UKI.parseShow(index, tableRow)
var show = new Show(UKI.orgKey);
var tds = Cheerio(tr).children('td');
show.dates= parseDates(tds.eq(0).text());
show.club = tds.eq(1).text().trim();
show.location = tds.eq(2).text().trim();
var link = tds.eq(5).find('a');
show.urls = [UKI_ROOT + link.attr('href')];
Summary
The final product
REST: add a “/v1” before the “/ag” in the URL to see JSON
Don’t force others to scrape you!
Questions?
Thanks for your attention.
Thanks to Ross for some good editorial suggestions
Morgan Conrad flyingspaniel.com
morganconrad@yahoo.com www.nextq.info