squid-2872

Version:

3.0 and 3.1 all (used 3.1.6 to diagnose)

Bug report link:

http://bugs.squid-cache.org/show_bug.cgi?id=2872

How it is diagnosed (reproduced or source analysis)?

Source analysis + reading discussion (we cannot reproduce the failure).

Symptom:

Squid is leaking file descriptors on stored objects. No error message was printed.

Because there is no error message, it was really difficult for the developers to diagnose the failure, resulting in multiple rounds of discussions between developers & users:

U: bug report, configuration information

D: Asking for more configuration information

.... Discussion ....

U: One user experienced exhaustion of file

    descriptor resulted from the leak, while

   others did not.

.... Discussion ....

D: asking for debug level log

U: Cannot reproduce the failure when debug level

    log is enabled.

U: Instrumented code to collect more logging

    information, log uploaded.

D: Ask for additional debugging information

U: More debug log uploaded

.... Discussion ....

D: patch issued. -- Added a warning message!

Root cause:

This occurs when squid is using asynchronous storage: A client request kicks off an
asynchronous file open request, so squid will not open that file immediately. If the client request is aborted and disappears
before the file open has completed, the file is never closed again.

During file open, squid used ‘cbdataReferenceValidDone’ to validate client’s data. Since the client died,‘cbdataReferenceValidDone’ returned false, indicating the data is not valid (because client exit early). However, developers did not handle this case properly (didn’t expect this). So when ‘cbdataReferenceValidDone’ returns false, ‘kickReadQueue’ should have also return false, so the caller can free things with: ‘’tryClosing”.

store_io_ufs.cc

UFSStoreState::kickReadQueue() {

   ... …  

   /* cbdataReferenceValidDone is to validate the store (cached) data

      by calling cbdataReferenceValid. It returns 0 if the data is not valid. */

   if (cbdataReferenceValidDone(q->callback_data, &cbdata)) {

        read_(q->buf, q->size, q->offset, q->callback, cbdata); //normal

   }

+    else {

+        debugs(79, 2, "UFSStoreState::kickReadQueue: this: " << this << " cbdataReferenceValidDone returned false." << " closing: " << closing << " flags.try_closing: " << flags.try_closing);

+        delete q;

+        return false;

+    }

   delete q;

   return true;

}

UFSStoreState::openDone() {

   ... ...

   if (FILE_MODE(mode) == O_WRONLY) {

       drainWriteQueue();

   } else if ((FILE_MODE(mode) == O_RDONLY) && !closing) {

       if (kickReadQueue())

           return;

   }

   if (flags.try_closing)

       tryClosing();

}

----------- Squid good practice on monitoring resource -------------------

Actually, if the resource is exhausted, there would be an error message printed in the following logic:

httpAccept(... ...) {

   if (flag == COMM_ERR_CLOSING) {

       return;

   }

   if (!okToAccept())

       AcceptLimiter::Instance().defer (sock, httpAccept, data);

   else

       comm_accept(sock, httpAccept, data);

   … …

}

bool okToAccept() {

   static time_t last_warn = 0;

   if (fdNFree() >= RESERVED_FD)

       return true;

   if (last_warn + 15 < squid_curtime) {

       debugs(33, 0, HERE << "WARNING! Your cache is running out of filedescriptors");

       last_warn = squid_curtime;

   }

   return false;

}

int fdNFree(void) {

   return Squid_MaxFD - Number_FD - Opening_FD;

}

   /* XXX account fd_table */

   /* Keep a few file descriptors free so that we don't run out of FD's

    * after accepting a client but before it opens a socket or a file.

    * Since Squid_MaxFD can be as high as several thousand, don't waste them */

   RESERVED_FD = min(100, Squid_MaxFD / 4);

So essentially, they have a number of reserved file descriptors (less than 100). As long as the available FS is lower than that number, they will print a warning.

Is there Error Message?

No.

Resource leak itself (unless exhausted) won’t have any message. The user reported the failure didn’t encounter error message.

However, if the resource really got EXHAUSTED, then there would be an error message below:

2010/04/07 04:06:23| client_side.cc(2980) okToAccept: WARNING! Your cache is running out of filedescriptors

But this is very hard to reach.

Can Errlog automatically insert an error msg?

Yes. The error location is the when library function ‘cbdataReferenceValidDone’ returns 0 -- indicating an error!

In fact ‘cbdataReferenceValidDone’ returns ‘true’ in most cases, and in many cases, when it returns ‘false’, it would be logged with an error msg. Therefore Errlog can potentially learn this error condition as it is a frequently logged pattern.

----------------------------------------- cbdata.cc: cbdataReferenceValidDone ---------------

cbdataInternalReferenceDoneValid()  // cbdataReferenceValidDone is defined as

{

   void *p = (void *) *pp;

   int valid = cbdataReferenceValid(p);

   if (valid) {

       *tp = p;

       return 1;

   } else {

       *tp = NULL;

        /* We can log here! */

       return 0;

   }

}

Errlog can also help diagnose this by Errlog-AG’s resource usage monitoring.