httpd-26467

Version:

2.0.55

Bug Link:

https://issues.apache.org/bugzilla/show_bug.cgi?id=26467

Symptom:

Httpd hangs after a graceful restart.

How did we diagnose this failure?

We did not reproduce it, only rely on source code analysis.

Root Cause:

Deadlock between processes.

A graceful restart stops the 'rotatelogs' child, but a long-lived httpd
process may still be serving a connection via KeepAlive.  The httpd child
continues to write logdata into the pipe, but there's no process reading from
the pipe anymore.  Eventually the pipe fills up, and the httpd child just hangs.

the process will hang at

apr_file_write(apr_file_t *thefile, const void *buf, apr_size_t *nbytes) {

   do {

               rv = write(thefile->filedes, buf, *nbytes);

       } while (rv == (apr_size_t)-1 && errno == EINTR);

   if (rv == (apr_size_t)-1 &&

       (errno == EAGAIN || errno == EWOULDBLOCK) &&  thefile->timeout != 0) {

       /* Errlog can put a log message here since ‘write’ already returned

          error, and very likely we will hang! */

           /* Hangs within apr_wait_for_io_or_timeout */

           apr_status_t arv = apr_wait_for_io_or_timeout(thefile, NULL, 0);

           if (arv != APR_SUCCESS) {

               *nbytes = 0;

               return arv;

           }

           else {

               do {

                   do {

                       rv = write(thefile->filedes, buf, *nbytes);

                   } while (rv == (apr_size_t)-1 && errno == EINTR);

                   if (rv == (apr_size_t)-1 &&

                       (errno == EAGAIN || errno == EWOULDBLOCK)) {

                       *nbytes /= 2; /* yes, we'll loop if kernel lied

                                      * and we can't even write 1 byte

                                      */

                   }

                   else {

                       break;

                   }

               } while (1);

           }

       }

….

apr_status_t apr_wait_for_io_or_timeout(apr_file_t *f, apr_socket_t *s,

                                         int for_read)

{

  …

  do {

      srv = apr_poll(&pollset, 1, &n, timeout);

      if (n == 1 && pollset.rtnevents & type) {

          return APR_SUCCESS;

      }

  } while (APR_STATUS_IS_EINTR(srv));

  return srv;

}

APR_DECLARE(apr_status_t) apr_poll(apr_pollfd_t *aprset, apr_int32_t num,

                    apr_int32_t *nsds, apr_interval_time_t timeout)

{

 ...

  if (timeout > 0) {

      timeout /= 1000; /* convert microseconds to milliseconds */

  }

  /* Hangs here: poll; for 5 minutes! */

  rv = poll(pollset->pollset, pollset->nelts, timeout);

  (*num) = rv;

  if (rv < 0) {

      return apr_get_netos_error();

  }

  if (rv == 0) {

    // Errlog can also put a message here -- indeed, it timed out!!!

      return APR_TIMEUP;

  }

}

Is there log Message.

No

Can Errlog print an error msg?

Yes! Similar to 22030, Errlog can anticipate by checking the return value!

Two possible places, both of the pattern ‘log at failed libc/system call’:

1) When a non-blocking read or write fails(because of an actual blocking happens), we give some warning messages, since this failure is probably caused by a potential deadlock.

2) When the polling times out, the anticipated read or write event still doesn’t come.