squid-2678

Version:

2.7.S7

Bug link:

http://bugs.squid-cache.org/show_bug.cgi?id=2678

Symptom (Failure):

Squid’s storeurl_rewrite module simply does not work! When a url hits storeurl_rewrite rule, which has to be rewritten when stored, the subsequent request will always result in cache miss.

Background:

Storeurl_rewrite is a feature in Squid:

http://wiki.squid-cache.org/Features/StoreUrlRewrite

The motivation is that for certain websites (especially those use CDN), the url of the webpage might vary each time. For example,

http://kh3.google.com.au/

http://kh0.google.com.au/

might all refer the same page. Without canonization, squid would store 2 copies of the same url. Storeurl_rewrite is to rewrite the actual url into some intermediate url, using regular expression rules, so the above 2 urls would result in the same internal url representation such as SQUIDINTERNAL.google.com.au. Users need to write a small program to define the canonization rules, including which URLs should be map to what.

However, in this version, this feature simply doesn’t work: the above url will never be served from cache, every subsequent request will always result in cache miss.

How it is diagnosed:

We reproduced the failure!

Simply enable the storeurl_rewrite module as described in:

http://wiki.squid-cache.org/Features/StoreUrlRewrite

Since we are using www.apache.org as test url, so we write a simple perl script as the storeurl_rewrite_program

#!/usr/bin/perl

$| = 1;

while (<>) {
       chomp;
       # print STDERR $_ . "\n";
       if (m/www.apache.(.*)/) {
               print "www.apache.SQUIDINTERNAL" . $1 . "\n";
       }
}

Then we use telnet to request www.apache.org, and observe each time squid return cache miss!

Root Cause:

When the cache content is stored, the canonized url (store_url) “www.apache.SQUIDINTERNAL” was never used, but instead, the hash key is generated for the original url: “www.apache.org”. During hash look up, however, the store_url rule was used, thus this inconsistency will never result cache hit.

This is similar to Apache httpd bug 38017 (see: https://docs.google.com/document/pub?id=1c1gCJ_s5pPzeksGM5oXe70E_ccBklVWa3yotrgmNLws).

During cache store: url rewrite rule was not applied:

/* storeAddVary is called during the cache store process.

  In the buggy version,

  url = www.apache.org;

  In the fixed version, pass in store_url in addition to url.

  store_url = www.apache.SQUIDINTERNAL */

-storeAddVary(const char *url, method_t * method, … ...)

+storeAddVary(const char *store_url, const char *url, … ...)

{

   AddVaryState *state;

   … …

  /* storeCreateEntry would set ‘e->mem_obj->url’ field to ‘www.apache.org’.

   * However, ‘e->mem_obj->store_url’ would be NULL. In the patch, after calling

   * storeCreateEntry, squid further sets ‘e->mem_obj->store_url’ to

   * ‘www.apache.SQUIDINTERNAL’. */

   state->e = storeCreateEntry(url, flags, method);

+  if (store_url)

+    state->e->mem_obj->store_url = xstrdup(store_url);

   /* This log message would be very helpful if printed, but unfortunately,

    * it is not in the default verbosity level, so not printed in

    * default setting. */

   debug(11, 2) ("storeAddVary: %s (%s) %s %s\n",

        state->url, state->key, state->vary_headers, state->etag);

   … ...

   storeSetPublicKey(state->e);

    … ...

}

void storeSetPublicKey(StoreEntry * e) {

   const cache_key *newkey;

   MemObject *mem = e->mem_obj;

   … …

  /* So newkey is generated based on www.apache.org,

   * not www.apache.SQUIDINTERNAL (later will show). */

   newkey = storeKeyPublic(storeLookupUrl(e), mem->method);

    … …

  /* Now, squid insert the newkey, along with e, into the cache.

   * This newkey is the hash value for ‘www.apache.org’

   *   -- the unmodified URL. */

   storeHashInsert(e, newkey);

   … ...

}

/* storeLookupUrl is to return the url to be encoded and stored as hash key.

 * It searches e->mem_obj->store_url, and only if this is not set, return the

 * e->mem_obj->url to be used as the stored hash key. It would be desirable

 * to log in this function to know which URL it returned. This falls into the

 * default-switch pattern: the actual returned value is a ‘default’ of all possible

 * ‘cases’. */

const char * storeLookupUrl(const StoreEntry * e) {

   if (e == NULL)

        return "[null_entry]";

   else if (e->mem_obj == NULL)

        return "[null_mem_obj]";

   else if (e->mem_obj->store_url)

        return e->mem_obj->store_url;

   else

       /* Error point: Errlog can put a log message here! */

        return e->mem_obj->url;

}

-------- Now, during the hash loop-up process, Squid uses the canonized name:

        www.apache.SQUIDINTERNAL to look up. Since ‘www.apache.org’ was stored

       as key, the look up would fail. -----

StoreEntry * storeGet(const cache_key * key)

{

   debug(20, 3) ("storeGet: looking up %s\n", storeKeyText(key));

   return (StoreEntry *) hash_lookup(store_table, key);

}

StoreEntry *

storeGetPublicByRequestMethod(request_t * req, const method_t method) {

   … ...

   return storeGet(storeKeyPublicByRequestMethod(req, method));

}

const cache_key * storeKeyPublicByRequestMethod(request_t * request, const method_t method) {

   … …

    /* (gdb) p req->store_url

          $31 = 0x8fd260 "http://www.apache.SQUIDINTERNAL/"

    */

   if (request->store_url) {

        url = request->store_url;

   } else {

        url = urlCanonical(request);

   }

  … …

}

Is there any log message?:

No.

Can Errlog print an error msg?

Yes. As discussed above, it would be desirable to log in function ‘storeLookupUrl’ to know which URL it returned and thus used as store key. This falls into the generalized default-switch pattern: it implemented the switch logic using if.. else if.., where the actual returned value is a ‘default’ of all possible ‘cases’. */