[redis]github-145-1 Report

https://github.com/antirez/redis/issues/145

This report actually has two failures: the first failure is crash, the developer fixed it, but there is another bug introduced by the fix. This document discuss the first failure (to the point where the first patch is generated), while “github-145-2” discusses the 2nd failure.

Version: 2.4.1

1. Symptom

When the original master went down, the user tried to make a slave server into master using “SLAVEOF” command. During this process this slave server crashes. (only if the slaveof command is received in a particular timing).

1.1 Severity

Severe. Listed as a severe failure in the redis.io/topics/problems page.

1.2 Was there exception thrown?

Yes. Connection problem & crash.

1.2.1 Were there multiple exceptions?

Yes. On the slave server:

1. cannot connect to master:

[2818] 17 Oct 18:59:59 # I/O error writing to MASTER: Connection refused
2. Crash:

[2818] 17 Oct 19:00:00 # ======= Ooops! Redis 2.4.1 got signal: -11- =======

1.3 Scope of the failure

Entire cluster is affected. This failure is in particular problematic as the user had HA enabled (so a crashed Master can be quickly replaced by a slave), but the slave kept crashing.

2. How to reproduce this failure

2.0 Version

2.4.1

2.1 Configuration

One master server and one slave server. (In fact, the minimum configuration to reproduce this failure only needs on slave server, since the master cannot be connected anyway).

2.2 Reproduction procedure

This failure requires particular timing order (not controllable externally). I will first describe the real-world scenario (but won’t guarantee the reproduction), and then describe how to use trick to expose.

1. set-up a slave (S) and a master (M)

2. shutdown (M)

3. make S into a master:

SLAVEOF NO ONE

4. see S crash

Here is how to guarantee the reproduction: use gdb to control the timing.

1. Start the slave with gdb --- no need master anyway.

2. Set a breakpoint on “replicationCron” (bp1)

3. The first time it hits the bp1, send a SLAVEOF NO ONE command, also set a breakpoint to “slaveofCommand” (bp2) and “syncWithMaster” (bp3)

4. If bp2 occurs before bp3, we can see the failure, otherwise, repeat this process by:

SLAVEOF ORIGINALMASTER ORIGINALMASTER_PORT

SLAVEOF NO ONE

2.2.1 Timing order

Order (slaveofCommand received btw replicationCron and syncWithMaster) is important.

2.2.2 Events order externally controllable?

No.

2.3 Can the logs tell how to reproduce the failure?

Yes. See the graph above: all the key events are logged!

2.4 How many machines needed?

1 (client + slave)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

Crash:

[2818] 17 Oct 19:00:00 # ======= Ooops! Redis 2.4.1 got signal: -11- =======
[2818] 17 Oct 19:00:00 # redis_version:2.4.1
.. ..

[2818] 17 Oct 19:00:00 # /lib/x86_64-linux-gnu/libc.so.6(inet_aton+0x35) [0x7f8a39989215]
[2818] 17 Oct 19:00:00 # /lib/x86_64-linux-gnu/libc.so.6(inet_aton+0x35) [0x7f8a39989215]
[2818] 17 Oct 19:00:00 # /usr/local/bin/redis-server() [0x40ca99]
[2818] 17 Oct 19:00:00 # /usr/local/bin/redis-server(connectWithMaster+0x1a) [0x41c14a]
[2818] 17 Oct 19:00:00 # /usr/local/bin/redis-server(replicationCron+0xe6) [0x41c416]
[2818] 17 Oct 19:00:00 # /usr/local/bin/redis-server(serverCron+0x1c8) [0x40fa78]
[2818] 17 Oct 19:00:00 # /usr/local/bin/redis-server(aeProcessEvents+0x1db) [0x40c45b]
[2818] 17 Oct 19:00:00 # /usr/local/bin/redis-server(aeMain+0x2e) [0x40c70e]
[2818] 17 Oct 19:00:00 # /usr/local/bin/redis-server(main+0x106) [0x40b606]
[2818] 17 Oct 19:00:00 # /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f8a398ba30d]
[2818] 17 Oct 19:00:00 # /usr/local/bin/redis-server() [0x40b72d]

3.2 Backward inference

From the log, we can observe the following events before the crash:

[2710] 17 Oct 16:56:25 * MASTER MODE enabled (user request)
[2710] 17 Oct 16:56:30 * Non blocking connect for SYNC fired the event.
[2710] 17 Oct 16:56:30 # I/O error writing to MASTER: Connection refused
[2710] 17 Oct 16:56:30 * Connecting to MASTER...
[2710] 17 Oct 16:56:30 # ======= Ooops! Redis 2.4.1 got signal: -11- =======

In fact, if we really infer the correlations among these events, we can understand the failure.

[2710] 17 Oct 16:56:25 * MASTER MODE enabled (user request)

void slaveofCommand(redisClient *c) {

if (!strcasecmp(c->argv[1]->ptr,"no") &&

!strcasecmp(c->argv[2]->ptr,"one")) {

if (server.masterhost) {

sdsfree(server.masterhost);

server.masterhost = NULL;

if (server.master) freeClient(server.master);

if (server.replstate == REDIS_REPL_TRANSFER)

replicationAbortSyncTransfer();

server.replstate = REDIS_REPL_NONE;

redisLog(REDIS_NOTICE,"MASTER MODE enabled (user request)");

}

} else .. ..

}

--- We know that slaveof no one is received. In particular, we know server.masterhost is set to NULL - eventually lead to crash!

----------------------------------------------------------------------

[2710] 17 Oct 16:56:30 * Non blocking connect for SYNC fired the event.
[2710] 17 Oct 16:56:30 # I/O error writing to MASTER: Connection refused

void syncWithMaster(aeEventLoop *el, int fd, void *privdata, int mask) {

redisLog(REDIS_NOTICE,"Non blocking connect for SYNC fired the event.");

/* Issue the SYNC command */

if (syncWrite(fd,"SYNC \r\n",7,server.repl_syncio_timeout) == -1) {

redisLog(REDIS_WARNING,"I/O error writing to MASTER: %s",

strerror(errno));

goto error;

}

error:

server.replstate = REDIS_REPL_CONNECT;

--- Must connect with MASTER again!

close(fd);

return;

}

-------------------------------------------------------------------------------

[2710] 17 Oct 16:56:30 * Connecting to MASTER...
[2710] 17 Oct 16:56:30 # ======= Ooops! Redis 2.4.1 got signal: -11- =======

void replicationCron(void) {

/* Check if we should connect to a MASTER */

if (server.replstate == REDIS_REPL_CONNECT) {

redisLog(REDIS_NOTICE,"Connecting to MASTER...");

if (connectWithMaster() == REDIS_OK) {

---> Crashed because server.masterhost is set to NULL above!!

}

Note: The timing of slaveofCommand is important: it has to occur before the syncWithMaster is called, so that making sure the next iteration for event handling, in replicationCron, server.replstate equals to “REDIS_REPL_CONNCT”.

4. Root cause

In syncWithMaster, it did not handle the case where server.replstate==REDIS_REPL_NONE --- if that’s the case (as in this scenario), it shouldn’t even try to connect with Master.

4.1 Category:

Incorrect error handling (missed error subtype).

5. Fix

5.1 How?

Return from syncWithMaster ASAP if the machine is no longer a slave (replstate == REDIS_REPL_NONE):

src/replication.c View file @ 98f5abb

@@ -376,6 +376,13 @@ void syncWithMaster(aeEventLoop *el, int fd, void *privdata, int mask) {

REDIS_NOTUSED(privdata);

REDIS_NOTUSED(mask);

+ /* If this event fired after the user turned the instance into a master

+ * with SLAVEOF NO ONE we must just return ASAP. */

+ if (server.replstate == REDIS_REPL_NONE) {

+ close(fd);

+ return;

+ }

redisLog(REDIS_NOTICE,"Non blocking connect for SYNC fired the event.");

/* This event should only be triggered once since it is used to have a

* non-blocking connect(2) to the master. It has been triggered when this