What is a Crawler?

A web crawler is a computer program that visits web sites and indexes pages. Google and other search engines run crawlers continuously. The result of a crawl is in an Inverse Search Index for the entire web-- the data that allows your searches to be instantaneous.

Here's the algorithm a crawler follows as it visits each site:


When the crawler visits a page, it creates a forward index of all of its terms. So if document1 consisted of the following text: 'The bad cow ate a cow', and document2 consisted of 'the bad moose died', then the crawler would create a forward index something like:

    document1: bad  1, cow  2, ate   1
    document2: bad   1, moose 1, died 1

with the counts specifying the number of times a term appears in the document, and stop words like 'the' and 'a' ignored. Of course indexing can be more complex and also record information such as the location of each 'hit' in the document.

Now Search engines need the inverse of such an index-- they need to find documents based on keywords. An inverse index uses the terms as the key.  For our sample, we'd get:

    bad --> document1, document2
    cow --> document1
    moose-->document2
    died-->document2
    ate-->document1
    
As the crawler visits each web page, it updates the inverse index. When a user initiates a search, the engine just finds the term(s) in the inverse index, grabs the list of documents that match, and determines the rank of each, and sends back the top results.

Ranking Pages

The ranking of search results-- which ten appear on the first page of results, and even which comes up first or second, is of great importance. In today's business world, and certainly e-commerce, it can determine whether a company is successful or not. It also a key determinant in politics and the battle of ideas and ideologies.

A page's rank is the key to it gaining popularity. One study
(link)  found that users chose the first search result 42% of the time, the second 8% of the time, and so on. Rarely do user's go to the second page of results.

Algorithm
Prior to Google, term analysis was the primary way that pages were ranked. Google changed this by including Page Rank-- a global measure of a site's popularity, based on the number of links coming into it--as a key ranking factor. Now ranking on most search engines is a combination of two factors:

    Ranking = Page Rank + Term Analysis

Most search engines don't publish their secret sauce-- their ranking algorithm. This has helped created a cottage industry of search engine optimization experts who specialize in understanding, through reverse engineering, the ranking algorithm of Google and other engines. Check out semoz.org's  keys to Google's ranking system. You can also use their on-line tools to check search attributes of your pages and keywords.

Term Analysis

The goal of term analysis is to measure what a page is about-- its topics-- based on the terms in the document. The simplest analysis just counts each of the terms in the document-- if 'flowers' appears many times, the analysis determines the pages is about flowers. More complex analysis can weigh terms based on where they appear in the web page, e.g., titles and text at the top of the document count more since they are more likely to have terms summarizing the topic of the page.

Another interesting technique used by many search engines  is to also consider the anchor text of inbound links. If other pages link to the page, the text near the link is a good indicator of what the linked to page is about. The terms  have the advantage of coming from a different source, so there is more trust that the terms are not there just to 'game' the search engines.

Page Rank

Page rank is Google's algorithm for measuring how popular a website is, globally, without respect to any particular keyword. Here's a description of Page Rank:

PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important" (from wikipedia)

In the general case, the PageRank value for any page u can be expressed as:

PR(u) = \sum_{v \in B_u} \frac{PR(v)}{L(v)},

Let's break down this formula. First, the set Bu is the set of all backward links of the page u, that is, all pages which link to the page u. But the formula doesn't just count how many pages link to u, it weighs each of those pages by how important they are-- their page rank. This gives us the numerator, PR(v).


But if a page points to u and also points to thousands of other pages, that 'vote' is not as important as the case of a page pointing to only u or u and a few other pages. This is where the denominator comes in-- L(v) is the number of outward links on the pages that point to u. The idea is that if a page v has many links, it is less likely that a random surfer will jump to the page u.


As an example, consider a small web of four documents, with the following links between them:


    b->a
    c->a
    d->a
    b->c
    d->b
    d->c


We'll initialize each page to .25 to start.Then:


        PR(a) = PR(b)/L(b) + PR(c)/L(c)+PR(d)/L(d)


        L(b)=2 since b points to two pages. L(c)=1 and L(d)=3.

so

        PR(a)=.25/2+.25/1+.25/3= .125+.25+.083=.45833333.


We then calculate the page ranks of B,C, and D, and continue until the ranks settle.


Damping factor-- Page Rank is based on the random surfer model, which says that a web surfer will randomly choose a link on a page. This idea is responsible for the L value above-- if an inbound page has many links, it doesn't count as much for each page it links to (think of it like the page splitting its vote). The random surfer model also takes into account the likelihood of a random surfer not following any links but instead typing in a new url. The damping factor is the likelihood that they will continue clicking links, and is generally estimated to be around .85 (so 15% of the time the user will enter a new url). With the damping factor included and also using the size of the universe of documents (the web), we get the page rank formula:

PR(p_i) = \frac{1-d}{N} + d \sum_{p_j \in M(p_i)} \frac{PR (p_j)}{L(p_j)}

Specialized Search

The World Live Web, e.g., Technorati

It takes ordinary web crawlers time to find newly created web pages, days and even weeks. Blog search tools don't crawl the entire web. Instead, they use a process based on pinging. When a blogger, twitterer, etc. posts a new item, the blogging system notifies (pings) a service that records all newly created pages. Blog search engines then use the pinging service to find newly created pages and index them. Within a matter of minutes, your newly created blog post can be searchable within these systems (hence, the term World Live Web)

Common blog tools include Wordpress and Blogger (now owned by Google). These tools automatically notify pinging services when one of their bloggers creates a new post. Tools like Technorati and Google Blog search then index these pages.

Historical Search

Internet Archive has the entire history of the web. You can access pages through its WayBack Machine. For instance, you can check out USFs homepages over the last fifteen years, or the way Yahoo has progressed.

The Archive is frequently used in legal matters and its existence, of course, makes many corporations and others nervous. Here's an article concerning how Healthcare Advocates sued the Internet Archive because they were sued by a company who used archived pages in their suit.

Visualization:
The Visualization of search results hasn't change much since the early search engines. There is research in this area, however. For instance, Kartoo is a tool that provides search results in a relational graph instead of a linear list. You can see not only URLs but terms and how they are related. You can also navigate information associatively.