[redis]github-607 Report

1. Symptom

Cluster crash entirely.

https://github.com/antirez/redis/issues/607

1.1 Severity

Severe (catastrophic since the entire cluster will crash).

1.2 Was there exception thrown?

Yes. Crash.

1.2.1 Were there multiple exceptions?

Yes. Before the crash, there was an error command:

redis 127.0.0.1:6379> MGET foo foobar
(error) ERR Multi keys request invalid in cluster

This command’s error is where the things start to go south.

1.3 Scope of the failure

Bring down a single node in the cluster setting. However, since the client will repeat the same command on the data before they got the results, so eventually this sequence of command will bring down the entire cluster. Catastrophic.

2. How to reproduce this failure

2.0 Version

2.9.7

2.1 Configuration

Standard

2.2 Reproduction procedure

0. start a cluster with three nodes

0. SET foobar test

1. redis 127.0.0.1:6379> MGET foo foobar
(error) ERR Multi keys request invalid in cluster
2. redis 127.0.0.1:6379> GETRANGE foobar 0 1
(error) MOVED 1650 127.0.0.1:6380
3. redis 127.0.0.1:6379> GETRANGE foobar 0 1
(error) ERR unknown command ''
4. redis 127.0.0.1:6379> GETRANGE foobar 0 1
Could not connect to Redis at 127.0.0.1:6379: Connection refused

It is also possible that the first GETRANGE command fails immediately after the “MGET” command.

2.2.1 Timing order

In this order

2.2.2 Events order externally controllable?

Yes.

2.3 Can the logs tell how to reproduce the failure?

The client log can tell.

2.4 How many nodes needed?

2 to start the cluster

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

They have the core-dump:

=== REDIS BUG REPORT START: Cut & paste starting from here ===
[21541] 29 Jul 10:52:23.419 #     Redis 2.9.7 crashed by signal: 11
[21541] 29 Jul 10:52:23.419 #     Failed assertion: <no assertion failed> (<no file>:0)
[21541] 29 Jul 10:52:23.419 # --- STACK TRACE
./redis-server(logStackTrace+0x71)[0x8083ad1]
./redis-server(decrRefCount+0x8)[0x8067e38]
[0x52440c]
./redis-server(decrRefCount+0x8)[0x8067e38]
./redis-server[0x8064121]
./redis-server(resetClient+0xf)[0x8064cff]
./redis-server(processInputBuffer+0x53)[0x8066083]
./redis-server(readQueryFromClient+0x9d)[0x806618d]
./redis-server(aeProcessEvents+0x140)[0x8057ac0]
./redis-server(aeMain+0x2c)[0x8057dbc]
./redis-server(main+0x299)[0x8056c99]
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x129113]
./redis-server[0x8056e09]

3.2 Backward inference

The problem is that the first time the command failed, the system did not handle it well and had an extra “decrRefCount” on the client object.

                 /* If it is not the first key, make sure it is exactly

                  * the same key as the first we saw. */

                 if (!equalStringObjects(firstkey,margv[keyindex[j]])) {

                  ← Error handling code

-                    decrRefCount(firstkey);

                     getKeysFreeResult(keyindex);

                     return NULL;

                 }

4. Root cause

The first time MGET returns error, the server incorrectly decremented the reference count.

4.1 Category:

Incorrect exception handling.