1 of 13

Crawling the web

Information Retrieval - University of Pisa

Marco Cornolti�

2 of 13

Aim: crawling

Crawling: download pages following links

Web structure: graph (nodes=pages, edges=HTML link)
Limit depth

P1

P3

P2

P4

P5

P6

d=1

d=2

d=3

3 of 13

Exercise: crawling IMDB

Starting from a specific link on IMDB

Download all movie pages within a certain distance
Extract information from them

4 of 13

Environment

Create a directory ir/pages/

From terminal:

sudo apt-get install python-scrapy python-lxml

5 of 13

Crawling the web with scrapy

run the crawler:�scrapy runspider imdb_crawl.py

(alternatively, if that doesn’t work:)�python -m scrapy.cmdline runspider imdb_crawl.py

entry web node

edit imdb_crawl.py:

import scrapy

class imdbSpider(scrapy.Spider):� name = "imdb"� allowed_domains = ["www.imdb.com"]� start_urls = [� "http://www.imdb.com/chart/top/" ]�� def parse(self, response):� print "Body:", response.body

domain restriction

response body�(generally HTML code)

6 of 13

Writing webpages to file

import scrapy�import re�import os��STORAGE_DIR = "pages"�URL_REGEX = "http://www.imdb.com/title/(tt\d{7})/\?.*"��class ImdbSpider(scrapy.Spider):� name = "imdb"� allowed_domains = ["www.imdb.com"]� start_urls = ["http://www.imdb.com/chart/top"]�� def parse(self, response):� if not response.headers.get("Content-Type").startswith("text/html"):� return� m = re.match(URL_REGEX, response.url)� if m:� filename = os.path.join(STORAGE_DIR, m.group(1) + ".html")� with open(filename, "wb") as f:� f.write(response.body)

output directory�(must be created before running)

same as before

filenames look like tt0123456.html

keep html only

7 of 13

Following links

run the crawler from terminal:

scrapy runspider imdb_crawl.py -s DEPTH_LIMIT=1

only follow links that lead to movies

add to imdb_crawl.py:

[...]

def parse(self, response):� if not response.headers.get("Content-Type").startswith("text/html"):� return� m = re.match(URL_REGEX, response.url)� if m:� filename = os.path.join(STORAGE_DIR, m.group(1) + ".html")� with open(filename, "wb") as f:� f.write(response.body)�� for href in response.xpath("//a/@href"):� url = response.urljoin(href.extract())� if re.match(URL_REGEX, url):� yield scrapy.Request(url)�

same as before

get html links

8 of 13

From HTML to data

HTML contains lots of stuff we are not interested in:

links
references to images
formatting
JavaScript code
CSS styles

We need to keep interesting data only.

9 of 13

HTML tree

<html>� <body>� <script>...</script>� <h1>The Godfather</h1>� <style>...</style>� <p itemprop="description">� The <b>aging patriarch</b> of an� organized crime dynasty...� </p>� </body>�</html>

<html>

<body>

<h1>

<style>

<p>

<b>

...

The Godfather

...

The

aging patriarch

of an organized...

text

tail

child

10 of 13

Data extraction

edit processhtml.py:

from lxml import etree�def html_to_data(html_file):� parser = etree.HTMLParser()� tree = etree.parse(html_file, parser)� if tree.getroot() is None: return None� title = None� nodes_title = tree.xpath('//meta[@property="og:title"]/@content')� if nodes_title: title = nodes_title[0]� return title

depends on what we need

edit parse_test.py:

from processhtml import *�if __name__ == "__main__":� print html_to_data("pages/tt0091042.html")

just an example

11 of 13

Exploring the file system

let’s rewrite parse_test.py:

import os�from processhtml import *��PAGES_DIR = "pages"��for filename in os.listdir(PAGES_DIR):� print html_to_data(os.path.join(PAGES_DIR, filename))

12 of 13

And More...

Extract more fields from IMDB pages: director, description, runtime, vote, genre, etc.

(optional) download more IMDB pages from:

http://bit.ly/1SYuht7

13 of 13

Pills of XPath

High-level introduction
W3C Tutorial.
Extensions for Chrome and Firefox.
XPath exercises.