[redis]gcode-648 Report

http://code.google.com/p/redis/issues/detail?id=648

1. Symptom

Redis server hangs. Performance.

1.1 Severity

Severe. This will hang the server and thus all the clients.

1.2 Was there exception thrown?

Yes(epoll returns error)

1.2.1 Were there multiple exceptions?

No.

1.3 Scope of the failure?

Entire server is down (all clients are affected).

2. How to reproduce this failure

2.0 Version

2.2.11

2.1 Configuration

Standard.

2.2 Reproduction procedure

Create > 10000 client connections.

2.2.1 Timing order

No particular order requirements.

2.2.2 Events order externally controllable?

Yes.

2.3 Can the logs tell how to reproduce the failure?

Yes.

[15796] 04 Aug 18:20:49 - Accepted 127.0.0.1:42151

[15796] 04 Aug 18:20:52 - 1 clients connected (0 slaves), 798576 bytes in use

[15796] 04 Aug 18:20:57 - 1 clients connected (0 slaves), 798576 bytes in use

[15796] 04 Aug 18:21:02 - 1 clients connected (0 slaves), 798576 bytes in use

[15796] 04 Aug 18:21:07 - 1 clients connected (0 slaves), 798576 bytes in use

[15796] 04 Aug 18:21:12 - 1 clients connected (0 slaves), 798576 bytes in use

[15796] 04 Aug 18:21:12 - Accepted 127.0.0.1:42152

[15796] 04 Aug 18:21:17 - 2 clients connected (0 slaves), 806672 bytes in use

[15796] 04 Aug 18:21:22 - Client closed connection

[15796] 04 Aug 18:21:22 - 1 clients connected (0 slaves), 798576 bytes in use

2.4 How many machines needed?

1. Thousands of clients.

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

There is no exceptions. But a common practice to diagnose hangs is to take stacktrace. Indeed multiple users took stack trace:

(gdb) bt
#0  0x00007fca4e596d03 in
epoll_wait () from /lib/libc.so.6
#1  0x000000000040d503 in aeApiPoll (eventLoop=0x7fca4d847000, flags=<value optimized out>) at ae_epoll.c:70
#2  aeProcessEvents (eventLoop=0x7fca4d847000, flags=<value optimized out>) at ae.c:330
#3  0x000000000040d83e in aeMain (eventLoop=0x7fca4d847000) at ae.c:387
#4  0x0000000000412f37 in main (argc=2, argv=<value optimized out>) at redis.c:1734

Another evidence: they can notice too many connection requests. This can be noticed via:

1). the event log (periodically prints the number of connections, see above).

2). redis-cli info:

https://gist.github.com/findchris/2424421

 (notice the number of clients)

or 3). netstat

3.2 Backward inference

Once we locate the code of epoll_wait (the connection handling code), we can notice that there is no error handling.

static int aeApiPoll(aeEventLoop *eventLoop, struct timeval *tvp) {

   retval = epoll_wait(state->epfd,state->events,AE_SETSIZE,

           tvp ? (tvp->tv_sec*1000 + tvp->tv_usec/1000) : -1);

      --- epoll_wait returned -1, but not handled!

   if (retval > 0) {

     .. ..

   }

   return numevents;

}

int aeProcessEvents(..) {

  numevents = aeApiPoll(eventLoop, tvp);

  .. ..

}

void aeMain(aeEventLoop *eventLoop) {

   eventLoop->stop = 0;

   while (!eventLoop->stop) { 

    --- infinite loop!

       if (eventLoop->beforesleep != NULL)

           eventLoop->beforesleep(eventLoop);

       aeProcessEvents(eventLoop, AE_ALL_EVENTS);

   }

}

4. Root cause

epoll_wait’s error return is not handled at all!

4.1 Category:

Incorrect error handling (not handled at all).

5. Fix

5.1 How?

Handle the error by closing client connections.

This is the new behavior of Redis 2.6:

1) At startup it checks the current maxclients setting (by default set to 10000), then it tries to set the file descriptor limit for the process to maxclients+32 (at max 32 more filedes are used for internal usage).

2) If it is not able to set to request as many filedes as maxclients is asking for, then it tries to find what is the limit for the process, and set the maxclients directive to the limit minus 32.

3) So now maxclients is always active, AND matches the actual process limit.

4) If maxclient is reached, the newly connected client receives an error and the connection is closed.

Cheers,
Salvatore