What is a Crawler?
A web crawler is a computer program that visits web sites and indexes pages. Google and other search engines run crawlers continuously. The result of a crawl is in an Inverse Search Index for the entire web-- the data that allows your searches to be instantaneous.
Here's the algorithm a crawler follows as it visits each site:
- If it is well behaved, it checks the file 'robots.txt' on a site to see if that site wants its pages to be crawled.
- It indexes each page by counting the number of times each important term occurs. It ignores stop-words like 'a', 'the', etc. Some crawlers give more weight to terms found in the header or at the top of the page.
- Adds the term-page relationships to the global Inverse Index for the web.
- Finds the links in each page and adds them to the global link graph for the web.
When the crawler visits a page, it creates a forward index of all of its terms. So if
document1 consisted of the following text: 'The bad cow ate a cow', and
document2 consisted of 'the bad moose died', then the crawler would
create a forward index something like:
document1: bad 1, cow 2, ate 1
document2: bad 1, moose 1, died 1
with
the counts specifying the number of times a term appears in the
document, and stop words like 'the' and 'a' ignored. Of course indexing
can be more complex and also record information such as the location of
each 'hit' in the document.
Now Search engines need the inverse
of such an index-- they need to find documents based on keywords. An
inverse index uses the terms as the key. For our sample, we'd get:
bad --> document1, document2
cow --> document1
moose-->document2
died-->document2
ate-->document1
As
the crawler visits each web page, it updates the inverse index. When a
user initiates a search, the engine just finds the term(s) in the
inverse index, grabs the list of documents that match, and determines
the rank of each, and sends back the top results.Ranking Pages
The ranking of search results-- which ten appear on the
first page of results, and even which comes up first or second, is of
great importance. In today's business world, and certainly e-commerce,
it can determine whether a company is successful or not. It also a key
determinant in politics and the battle of ideas and ideologies.
A page's rank is the key to it gaining popularity. One
study (link) found that users chose the first search result 42% of the time, the second 8%
of the time, and so on. Rarely do user's go to the second page of results.
Algorithm
Prior to
Google, term analysis was the primary way that pages were ranked. Google
changed this by including Page Rank-- a global measure of a site's
popularity, based on the number of links coming into it--as a key ranking factor. Now ranking on most search engines
is a combination of two factors:
Ranking = Page Rank + Term Analysis
Most
search engines don't publish their secret sauce-- their ranking
algorithm. This has helped created a cottage industry of search engine
optimization experts who specialize in understanding, through reverse
engineering, the ranking algorithm of Google and other engines. Check
out semoz.org's keys to Google's ranking system. You can also use their on-line tools to check search attributes of your pages and keywords.
Term Analysis
The goal of term analysis is to measure what a page is about-- its topics-- based on the terms in the document. The simplest analysis just counts each of the terms in the document-- if 'flowers' appears many times, the analysis determines the pages is about flowers. More complex analysis can weigh terms based on where they appear in the web page,
e.g., titles and text at the top of the document count more since they are
more likely to have terms summarizing the topic of the page.
Another interesting technique used by many search engines is to also consider the anchor text of inbound links. If other pages link to the page, the text near the
link is a good indicator of what the linked to page is about. The
terms have the advantage of coming from a different source, so
there is more trust that the terms are not there just to 'game' the
search engines.
Page Rank
Page rank is Google's algorithm for measuring how popular a website is, globally, without respect to any particular keyword. Here's a description of Page Rank:
PageRank relies on the uniquely democratic nature of the web by using
its vast link structure as an indicator of an individual page's value.
In essence, Google interprets a link from page A to page B as a vote,
by page A, for page B. But, Google looks at more than the sheer volume
of votes, or links a page receives; it also analyzes the page that
casts the vote. Votes cast by pages that are themselves "important"
weigh more heavily and help to make other pages "important" (from wikipedia)
In the general case, the PageRank value for any page u can be expressed as:

,
Let's break down this formula. First, the set Bu is the set of all backward links of the page u, that is, all pages which link to the page u. But the formula doesn't just count how many pages link to u, it weighs each of those pages by how important they are-- their page rank. This gives us the numerator, PR(v).
But if a page points to u and also points to thousands of other pages, that 'vote' is not as important as the case of a page pointing to only u or u and a few other pages. This is where the denominator comes in-- L(v) is the number of outward links on the pages that point to u. The idea is that if a page v has many links, it is less likely that a random surfer will jump
to the page u.
As an example, consider a small web of four documents, with the following links between them:
b->a
c->a
d->a
b->c
d->b
d->c
We'll initialize each page to .25 to start.Then:
PR(a) = PR(b)/L(b) + PR(c)/L(c)+PR(d)/L(d)
L(b)=2 since b points to two pages. L(c)=1 and L(d)=3.
so
PR(a)=.25/2+.25/1+.25/3= .125+.25+.083=.45833333.
We then
calculate the page ranks of B,C, and D, and continue until the ranks
settle.
Damping factor-- Page
Rank is based on the random surfer model, which says that a web surfer
will randomly choose a link on a page. This idea is responsible for the
L value above-- if an inbound page has many links, it doesn't count as
much for each page it links to (think of it like the page splitting its
vote). The random surfer model also takes into account the likelihood
of a random surfer not following any links but instead typing in a new
url. The damping factor is the likelihood that they will continue
clicking links, and is generally estimated to be around .85 (so 15% of
the time the user will enter a new url). With the damping factor
included and also using the size of the universe of documents (the
web), we get the page rank formula:
Specialized Search
The World Live Web, e.g., Technorati
It takes ordinary web crawlers time to find newly created web pages, days and even weeks. Blog search tools don't crawl the entire web. Instead, they use a process based on pinging. When a blogger, twitterer, etc. posts a new item, the blogging system notifies (pings) a service that records all newly created pages. Blog search engines then use the pinging service to find newly created pages and index them. Within a matter of minutes, your newly created blog post can be searchable within these systems (hence, the term World Live Web)
Common blog tools include Wordpress and Blogger (now owned by Google). These tools automatically notify pinging services when one of their bloggers creates a new post. Tools like Technorati and Google Blog search then index these pages.
Historical Search
Internet Archive has the entire history of the web. You can access pages through its WayBack Machine. For instance, you can check out USFs homepages over the last fifteen years, or the way Yahoo has progressed.
The Archive is frequently used in legal matters and its existence, of course, makes many corporations and others nervous. Here's an article
concerning how Healthcare Advocates sued the Internet Archive because
they were sued by a company who used archived pages in their suit.
Visualization:
The Visualization of search results hasn't change much since the early search engines. There is research in this area, however. For instance, Kartoo is a tool that provides search results in a relational graph instead of a linear list. You can see not only URLs but terms and how they are related. You can also navigate information associatively.