[redis]github-518 Report
CPU usage to 100% for slaves when the connections to masters are down..
Severe. Listed as severe one in redis.io/topics/problems
Yes. Master cannot be reached:
[12727] 23 May 11:17:11 # I/O error writing to MASTER: Connection timed out
No.
Affect the performance of the entire DS.
2.4.11
Slave + master
1. Start slave;
2. stop master;
3. observe slave fall into infinite loop.
The timing order is important.
No. (OS bug)
Yes.
1 (2 nodes, slave + master)
There is this error msg. Also the user attached a gdb and got the backtrace:
(gdb) bt
#0 0x000000302afbd1e2 in poll () from /lib64/tls/libc.so.6
#1 0x000000000040c6da in aeWait (fd=Variable "fd" is not available.
) at ae.c:371
#2 0x0000000000431026 in syncWrite (fd=7, ptr=0x446da2 "SYNC \r\n", size=7, timeout=6) at syncio.c:47
#3 0x000000000041bf0a in syncWithMaster (el=Variable "el" is not available.
) at replication.c:406
#4 0x000000000040c4d3 in aeProcessEvents (eventLoop=0x7f0e12c47000, flags=Variable "flags" is not available.
) at ae.c:344
#5 0x000000000040c755 in aeMain (eventLoop=0x7f0e12c47000) at ae.c:385
#6 0x0000000000411255 in main (argc=Variable "argc" is not available.
) at redis.c:1797
So since it hangs in poll in aeWait, we can locate the code as shown in the graph. It’s not hard to know that poll probably didn’t succeed since the master is down. But the error is that it didn’t handle the error return of poll, when revent is set to:
POLLERR and POLLHUP
Consequently, it will result in an infinite loop in “syncWrite”.
aeWait didn’t handle the error return of “poll” -- when revent set to POLLERR.
Incorrect error handling.
src/ae.c View file @ bd9dd56
@@ -371,6 +371,8 @@ int aeWait(int fd, int mask, long long milliseconds) {
if ((retval = poll(&pfd, 1, milliseconds))== 1) {
if (pfd.revents & POLLIN) retmask |= AE_READABLE;
if (pfd.revents & POLLOUT) retmask |= AE_WRITABLE;
+ if (pfd.revents & POLLERR) retmask |= AE_WRITABLE;
+ if (pfd.revents & POLLHUP) retmask |= AE_WRITABLE;
return retmask;
} else {
return retval;