[redis]github-267 Report

https://github.com/antirez/redis/issues/267

1. Symptom

Redis server simply hangs (or very slow).

1.1 Severity

Severe. Marked as severe bug in the problems page.

1.2 Was there exception thrown?

Yes.

Slave:

 [5894] 23 Dec 14:00:46 # MASTER aborted replication with an error: ERR operation not permitted

[5894] 23 Dec 14:01:02 # I/O error reading bulk count from MASTER: Connection timed out

1.2.1 Were there multiple exceptions?

Yes.

1.3 Scope of the failure

The entire server hangs.

2. How to reproduce this failure

2.0 Version

2.4.5

2.1 Configuration

Master and slave.

Master’s configuration file:

“requirepass foobared”

Slave’s configuration file:

port 6380

slaveof 127.0.0.1 6379

masterauth foobared

2.2 Reproduction procedure

The way to reproduce the problem with latest stable version:

1. start a master and set a password for it

#./redis-server

config set requirepass foobar

2.start a slave and sync from master:

#./redis-server -

port 6380

slaveof 127.0.0.1 6379

3.Attach 1100 idle clients to the slave:

#./redis-benchmark -p 6380 -c 1100 -I

4.observe the hang.

2.2.1 Timing order

Send > 1024 connections will be fine. No particular order requirement.

2.2.2 Deterministic?

Yes.

2.3 Can the logs tell how to reproduce the failure?

Yes.

[15796] 04 Aug 18:20:49 - Accepted 127.0.0.1:42151

[15796] 04 Aug 18:20:52 - 1 clients connected (0 slaves), 798576 bytes in use

[15796] 04 Aug 18:20:57 - 1 clients connected (0 slaves), 798576 bytes in use

[15796] 04 Aug 18:21:02 - 1 clients connected (0 slaves), 798576 bytes in use

[15796] 04 Aug 18:21:07 - 1 clients connected (0 slaves), 798576 bytes in use

[15796] 04 Aug 18:21:12 - 1 clients connected (0 slaves), 798576 bytes in use

[15796] 04 Aug 18:21:12 - Accepted 127.0.0.1:42152

2.4 How many machines needed?

1.

2 nodes, 1 master 1 slave.

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

[5894] 23 Dec 14:01:02 # I/O error reading bulk count from MASTER: Connection timed out

if (syncReadLine(fd,buf,1024,3600) == -1) {

  close(fd);

  redisLog(REDIS_WARNING,"I/O error reading bulk count from MASTER: %s",

  strerror(errno));

  return REDIS_ERR;

 }

3.2 Backward inference

Eventually, we can see that “syncReadLine’s error return is from aeWait:

/* Wait for millseconds until the given file descriptor becomes

 * writable/readable/exception */

int aeWait(int fd, int mask, long long milliseconds) {

    struct timeval tv;

    fd_set rfds, wfds, efds;

    int retmask = 0, retval;

    tv.tv_sec = milliseconds/1000;

    tv.tv_usec = (milliseconds%1000)*1000;

    FD_ZERO(&rfds);

    FD_ZERO(&wfds);

    FD_ZERO(&efds);

    if (mask & AE_READABLE) FD_SET(fd,&rfds);

    if (mask & AE_WRITABLE) FD_SET(fd,&wfds);

    if ((retval = select(fd+1, &rfds, &wfds, &efds, &tv)) > 0) {

        if (FD_ISSET(fd,&rfds)) retmask |= AE_READABLE;

        if (FD_ISSET(fd,&wfds)) retmask |= AE_WRITABLE;

        return retmask;

    } else {

        return retval;

         --- error return!

    }

}

Also, from the log we can see that there are too many client connections (resulting in too many connections from slave to master). This can at least tell us how to reproduce the failure.

Another step of inference (which is very hard with only the log & source), is that FD_SET’s behavior is undefined if the size of the file descriptor set is too large (larger than 1024 on linux). And apparently, in this case redis didn’t handle it. In fact in gdb we can observe that once the number of fd is greater than 1024, the side-effect of FD_SET in the code above is pretty dramatic:

It will set tv.tv_sec to a large number!!!

Therefore, the select will block for a long time since tv.tv_sec is a large number!!!

Debugging using gdb shows that the FD_SET macro modified tv accidentally in line 371 and 372.

If we move line 365 and 366 after line 372, then it works as expected:

#strace -T ./redis-server

...

select(1021, [1020], [], [], {0, 10000}) = 0 (Timeout) < 0.010135>

select(1022, [1021], [], [], {0, 10000}) = 0 (Timeout) < 0.010162>

select(1023, [1022], [], [], {0, 10000}) = 0 (Timeout) < 0.010146>

select(1024, [1023], [], [], {0, 10000}) = 0 (Timeout) < 0.010123>

select(1025, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010143>

select(1026, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010137>

select(1027, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010388>

select(1028, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010151>

select(1029, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010206>

select(1030, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010160>

…”

4. Root cause

When the number of file descriptor set is > 1024, the FD_SET’s behavior (undefined) will accidentally set timeout parameter to select, causing select to block.

4.1 Category:

Incorrect error handling. But to test it, we need “environment coverage” on error handler --- the undefined behavior only appears on certain OS versions.

5. Fix

5.1 How?

Using poll instead of select and also handle too many client connections.