2.6.STABLE20 (fixed in 2.6.STABLE21)

Bug link:



How this failure is diagnosed (did we reproduce it)?

Since this failure requires complex environment set-up, it is not reproducible. We only analyzed the source code and the discussions.



When an array of 30 squid servers running together, sitting behind a load balancer, from time-to-time the number of "available file descriptors" gradually drops from 14K (when under normal load) down, but only on particular squid at a time. For user, access to particular website simply freeze.

Detailed & Interesting:

Here is the entire diagnosis process:

U: bug report, debug level log & other observations uploaded

D: Issued patch to collect additional info. Also ask for other info.

U: Richer log and additional info uploaded.

D: Cannot figure out the problem.

U: More debug log uploaded.

.... Discussion ....

D: Ask for DNS statistics.

U: DNS statistics uploaded.

D: Issued patch to collect additional log info.

U: Additional info uploaded.

D: patch issued. -- Added an error message!

Root cause:

Bug in the DNS TCP connection failure management. Whenever there is a DNS failure, squid immediately tries to resend the DNS query. Each of such queries will open a socket and keeps it open until it receives response or expires.

HOWEVER, in the buggy version, when there is a DNS failure, there was not even an error message!!! Together with fixing the failure, developers also added an error message to indicate a DNS failure!

These open sockets would exist on a queue (idns_query), and quickly make the queue full, thus would freeze there until some query time-out.

Thus the combined result could end up with the number of outstanding DNS quereis growing indefinitely.

Here is the patch:

diff -u -p -r1.69 dns_internal.c

@@ -749,8 +747,9 @@ idnsSendTcpQuery(int fd, int status, voi

    idns_query *q = data;

    short nsz;

    if (status != COMM_OK) {

-        dlinkDelete(&q->lru, &lru_list);

-        idnsSendQuery(q);

+        int ns = (q->nsends - 1) % nns;

         /* As we can see, there is a DNS loop up failure,

             but previously there is

             not even an error message indicating such an error! */

+        debug(78, 1) ("idnsSendTcpQuery: Failed to connect to DNS server %d using TCP\n", ns + 1);

+        idnsTcpCleanup(q);




Here is the call stack to ‘idnsSendTcpQuery’:

#0 idnsSendTcpQuery

#1  0x000000000042ba66 in commConnectCallback (cs=0xa366f8, status=-8)

   at comm.c:321

#2  0x000000000042b92d in commConnectDnsHandle (ia=0x0, data=0xa366f8)

   at comm.c:297

#3  0x000000000045adbb in ipcacheCallback (i=0xa3e4c0) at ipcache.c:224

#4  0x000000000045b36c in ipcacheHandleReply (data=0x8d4858, answers=0x0,

   na=-16, error_message=0x4a6e2b "Timeout") at ipcache.c:400

#5  0x0000000000431482 in idnsCallback (q=0xa33aa8, answers=0x0, n=-16,

   error=0x4a6e2b "Timeout") at dns_internal.c:646

#6  0x0000000000432458 in idnsCheckQueue (unused=0x0) at dns_internal.c:920

static void

commConnectDnsHandle(const ipcache_addrs * ia, void *data)


   … …

  /* ia == NULL indicates DNS error!*/

   if (ia == NULL) {

      /* When we are here, DNS error already occurred. The below

          log message would be very helpful if printed, but unfortunately,

          it is not printed by default (level 3, with default level 1).*/

      debug(5, 3) ("commConnectDnsHandle: Unknown host: %s\n", cs->host);


      commConnectCallback(cs, COMM_ERR_DNS);




… …

static void

idnsCheckQueue(void *unused)


   … ...

   /* The root cause is here: When we are calling idnsCheckQueue from the periodical event (Squid is an event-driven system), we check if DNS has timed-out. If so, we will call idnsCallback with NULL, and “Timeout”, thus result in the logic above. */

   if (tvSubDsec(q->start_t, current_time) < Config.Timeout.idns_query) {


   } else {

       … ...

       idnsCallback(q, NULL, -16, "Timeout");





Is there Error Message?


Can developers anticipate the Failure?

Yes. Developers already checked for the DNS error, but they didn’t log it.

How can Errlog handle it?

Learning frequent logging pattern. In fact, the condition:

“status != COMM_OK”

is a frequently logged condition!