[redis]github-267 Report
https://github.com/antirez/redis/issues/267
Redis server simply hangs (or very slow).
Severe. Marked as severe bug in the problems page.
Yes.
Slave:
[5894] 23 Dec 14:00:46 # MASTER aborted replication with an error: ERR operation not permitted
[5894] 23 Dec 14:01:02 # I/O error reading bulk count from MASTER: Connection timed out
Yes.
The entire server hangs.
2.4.5
Master and slave.
Master’s configuration file:
“requirepass foobared”
Slave’s configuration file:
port 6380
slaveof 127.0.0.1 6379
masterauth foobared
“The way to reproduce the problem with latest stable version:
1. start a master and set a password for it
#./redis-server
config set requirepass foobar
2.start a slave and sync from master:
#./redis-server -
port 6380
slaveof 127.0.0.1 6379
3.Attach 1100 idle clients to the slave:
#./redis-benchmark -p 6380 -c 1100 -I
4.observe the hang.“
Send > 1024 connections will be fine. No particular order requirement.
Yes.
Yes.
[15796] 04 Aug 18:20:49 - Accepted 127.0.0.1:42151
[15796] 04 Aug 18:20:52 - 1 clients connected (0 slaves), 798576 bytes in use
[15796] 04 Aug 18:20:57 - 1 clients connected (0 slaves), 798576 bytes in use
[15796] 04 Aug 18:21:02 - 1 clients connected (0 slaves), 798576 bytes in use
[15796] 04 Aug 18:21:07 - 1 clients connected (0 slaves), 798576 bytes in use
[15796] 04 Aug 18:21:12 - 1 clients connected (0 slaves), 798576 bytes in use
[15796] 04 Aug 18:21:12 - Accepted 127.0.0.1:42152
1.
2 nodes, 1 master 1 slave.
[5894] 23 Dec 14:01:02 # I/O error reading bulk count from MASTER: Connection timed out
if (syncReadLine(fd,buf,1024,3600) == -1) {
close(fd);
redisLog(REDIS_WARNING,"I/O error reading bulk count from MASTER: %s",
strerror(errno));
return REDIS_ERR;
}
Eventually, we can see that “syncReadLine’s error return is from aeWait:
/* Wait for millseconds until the given file descriptor becomes
* writable/readable/exception */
int aeWait(int fd, int mask, long long milliseconds) {
struct timeval tv;
fd_set rfds, wfds, efds;
int retmask = 0, retval;
tv.tv_sec = milliseconds/1000;
tv.tv_usec = (milliseconds%1000)*1000;
FD_ZERO(&rfds);
FD_ZERO(&wfds);
FD_ZERO(&efds);
if (mask & AE_READABLE) FD_SET(fd,&rfds);
if (mask & AE_WRITABLE) FD_SET(fd,&wfds);
if ((retval = select(fd+1, &rfds, &wfds, &efds, &tv)) > 0) {
if (FD_ISSET(fd,&rfds)) retmask |= AE_READABLE;
if (FD_ISSET(fd,&wfds)) retmask |= AE_WRITABLE;
return retmask;
} else {
return retval;
--- error return!
}
}
Also, from the log we can see that there are too many client connections (resulting in too many connections from slave to master). This can at least tell us how to reproduce the failure.
Another step of inference (which is very hard with only the log & source), is that FD_SET’s behavior is undefined if the size of the file descriptor set is too large (larger than 1024 on linux). And apparently, in this case redis didn’t handle it. In fact in gdb we can observe that once the number of fd is greater than 1024, the side-effect of FD_SET in the code above is pretty dramatic:
It will set tv.tv_sec to a large number!!!
Therefore, the select will block for a long time since tv.tv_sec is a large number!!!
“Debugging using gdb shows that the FD_SET macro modified tv accidentally in line 371 and 372.
If we move line 365 and 366 after line 372, then it works as expected:
#strace -T ./redis-server
...
select(1021, [1020], [], [], {0, 10000}) = 0 (Timeout) < 0.010135>
select(1022, [1021], [], [], {0, 10000}) = 0 (Timeout) < 0.010162>
select(1023, [1022], [], [], {0, 10000}) = 0 (Timeout) < 0.010146>
select(1024, [1023], [], [], {0, 10000}) = 0 (Timeout) < 0.010123>
select(1025, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010143>
select(1026, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010137>
select(1027, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010388>
select(1028, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010151>
select(1029, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010206>
select(1030, [], [], [], {0, 10000}) = 0 (Timeout) < 0.010160>
…”
When the number of file descriptor set is > 1024, the FD_SET’s behavior (undefined) will accidentally set timeout parameter to select, causing select to block.
Incorrect error handling. But to test it, we need “environment coverage” on error handler --- the undefined behavior only appears on certain OS versions.
Using poll instead of select and also handle too many client connections.