[redis]github-518 Report

1. Symptom

CPU usage to 100% for slaves when the connections to masters are down..

1.1 Severity

Severe. Listed as severe one in redis.io/topics/problems

1.2 Was there exception thrown?

Yes. Master cannot be reached:

[12727] 23 May 11:17:11 # I/O error writing to MASTER: Connection timed out

1.2.1 Were there multiple exceptions?

No.

1.3 Scope of the failure

Affect the performance of the entire DS.

2. How to reproduce this failure

2.0 Version

2.4.11

2.1 Configuration

Slave + master

2.2 Reproduction procedure

1. Start slave;

2. stop master;

3. observe slave fall into infinite loop.

2.2.1 Timing order

The timing order is important.

2.2.2 Events order externally controllable?

No. (OS bug)

2.3 Can the logs tell how to reproduce the failure?

Yes.

2.4 How many machines needed?

1 (2 nodes, slave + master)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

There is this error msg. Also the user attached a gdb and got the backtrace:

(gdb) bt

#0 0x000000302afbd1e2 in poll () from /lib64/tls/libc.so.6

#1 0x000000000040c6da in aeWait (fd=Variable "fd" is not available.

) at ae.c:371

#2 0x0000000000431026 in syncWrite (fd=7, ptr=0x446da2 "SYNC \r\n", size=7, timeout=6) at syncio.c:47

#3 0x000000000041bf0a in syncWithMaster (el=Variable "el" is not available.

) at replication.c:406

#4 0x000000000040c4d3 in aeProcessEvents (eventLoop=0x7f0e12c47000, flags=Variable "flags" is not available.

) at ae.c:344

#5 0x000000000040c755 in aeMain (eventLoop=0x7f0e12c47000) at ae.c:385

#6 0x0000000000411255 in main (argc=Variable "argc" is not available.

) at redis.c:1797

3.2 Backward inference

So since it hangs in poll in aeWait, we can locate the code as shown in the graph. It’s not hard to know that poll probably didn’t succeed since the master is down. But the error is that it didn’t handle the error return of poll, when revent is set to:

POLLERR and POLLHUP

Consequently, it will result in an infinite loop in “syncWrite”.

4. Root cause

aeWait didn’t handle the error return of “poll” -- when revent set to POLLERR.

4.1 Category:

Incorrect error handling.

5. Fix

5.1 How?

src/ae.c View file @ bd9dd56

@@ -371,6 +371,8 @@ int aeWait(int fd, int mask, long long milliseconds) {

     if ((retval = poll(&pfd, 1, milliseconds))== 1) {

         if (pfd.revents & POLLIN) retmask |= AE_READABLE;

         if (pfd.revents & POLLOUT) retmask |= AE_WRITABLE;

+        if (pfd.revents & POLLERR) retmask |= AE_WRITABLE;

+        if (pfd.revents & POLLHUP) retmask |= AE_WRITABLE;

         return retmask;

     } else {

         return retval;