Crawling the web
Information Retrieval - University of Pisa
Marco Cornolti�
Aim: crawling
Crawling: download pages following links
P1
P3
P2
P4
P5
P6
d=1
d=2
d=3
Exercise: crawling IMDB
Starting from a specific link on IMDB
Environment
From terminal:
sudo apt-get install python-scrapy python-lxml
Crawling the web with scrapy
run the crawler:�scrapy runspider imdb_crawl.py
(alternatively, if that doesn’t work:)�python -m scrapy.cmdline runspider imdb_crawl.py
entry web node
edit imdb_crawl.py:
import scrapy
class imdbSpider(scrapy.Spider):� name = "imdb"� allowed_domains = ["www.imdb.com"]� start_urls = [� "http://www.imdb.com/chart/top/" ]�� def parse(self, response):� print "Body:", response.body
domain restriction
response body�(generally HTML code)
Writing webpages to file
import scrapy�import re�import os��STORAGE_DIR = "pages"�URL_REGEX = "http://www.imdb.com/title/(tt\d{7})/\?.*"��class ImdbSpider(scrapy.Spider):� name = "imdb"� allowed_domains = ["www.imdb.com"]� start_urls = ["http://www.imdb.com/chart/top"]�� def parse(self, response):� if not response.headers.get("Content-Type").startswith("text/html"):� return� m = re.match(URL_REGEX, response.url)� if m:� filename = os.path.join(STORAGE_DIR, m.group(1) + ".html")� with open(filename, "wb") as f:� f.write(response.body)
output directory�(must be created before running)
same as before
filenames look like tt0123456.html
keep html only
Following links
run the crawler from terminal:
scrapy runspider imdb_crawl.py -s DEPTH_LIMIT=1
only follow links that lead to movies
add to imdb_crawl.py:
[...]
def parse(self, response):� if not response.headers.get("Content-Type").startswith("text/html"):� return� m = re.match(URL_REGEX, response.url)� if m:� filename = os.path.join(STORAGE_DIR, m.group(1) + ".html")� with open(filename, "wb") as f:� f.write(response.body)�� for href in response.xpath("//a/@href"):� url = response.urljoin(href.extract())� if re.match(URL_REGEX, url):� yield scrapy.Request(url)�
same as before
get html links
From HTML to data
We need to keep interesting data only.
HTML tree
<html>� <body>� <script>...</script>� <h1>The Godfather</h1>� <style>...</style>� <p itemprop="description">� The <b>aging patriarch</b> of an� organized crime dynasty...� </p>� </body>�</html>
<html>
<body>
<script>
<h1>
<style>
<p>
<b>
...
The Godfather
...
The
aging patriarch
of an organized...
text
tail
child
Data extraction
edit processhtml.py:
from lxml import etree�def html_to_data(html_file):� parser = etree.HTMLParser()� tree = etree.parse(html_file, parser)� if tree.getroot() is None: return None� title = None� nodes_title = tree.xpath('//meta[@property="og:title"]/@content')� if nodes_title: title = nodes_title[0]� return title
depends on what we need
edit parse_test.py:
from processhtml import *�if __name__ == "__main__":� print html_to_data("pages/tt0091042.html")
just an example
Exploring the file system
let’s rewrite parse_test.py:
import os�from processhtml import *��PAGES_DIR = "pages"��for filename in os.listdir(PAGES_DIR):� print html_to_data(os.path.join(PAGES_DIR, filename))
And More...
Extract more fields from IMDB pages: director, description, runtime, vote, genre, etc.
(optional) download more IMDB pages from:
http://bit.ly/1SYuht7
Pills of XPath