[redis]github-145-1 Report
https://github.com/antirez/redis/issues/145
This report actually has two failures: the first failure is crash, the developer fixed it, but there is another bug introduced by the fix. This document discuss the first failure (to the point where the first patch is generated), while “github-145-2” discusses the 2nd failure.
Version: 2.4.1
When the original master went down, the user tried to make a slave server into master using “SLAVEOF” command. During this process this slave server crashes. (only if the slaveof command is received in a particular timing).
Severe. Listed as a severe failure in the redis.io/topics/problems page.
Yes. Connection problem & crash.
Yes. On the slave server:
1. cannot connect to master:
[2818] 17 Oct 18:59:59 # I/O error writing to MASTER: Connection refused
2. Crash:
[2818] 17 Oct 19:00:00 # ======= Ooops! Redis 2.4.1 got signal: -11- =======
Entire cluster is affected. This failure is in particular problematic as the user had HA enabled (so a crashed Master can be quickly replaced by a slave), but the slave kept crashing.
2.4.1
One master server and one slave server. (In fact, the minimum configuration to reproduce this failure only needs on slave server, since the master cannot be connected anyway).
This failure requires particular timing order (not controllable externally). I will first describe the real-world scenario (but won’t guarantee the reproduction), and then describe how to use trick to expose.
1. set-up a slave (S) and a master (M)
2. shutdown (M)
3. make S into a master:
SLAVEOF NO ONE
4. see S crash
Here is how to guarantee the reproduction: use gdb to control the timing.
1. Start the slave with gdb --- no need master anyway.
2. Set a breakpoint on “replicationCron” (bp1)
3. The first time it hits the bp1, send a SLAVEOF NO ONE command, also set a breakpoint to “slaveofCommand” (bp2) and “syncWithMaster” (bp3)
4. If bp2 occurs before bp3, we can see the failure, otherwise, repeat this process by:
SLAVEOF ORIGINALMASTER ORIGINALMASTER_PORT
SLAVEOF NO ONE
Order (slaveofCommand received btw replicationCron and syncWithMaster) is important.
No.
Yes. See the graph above: all the key events are logged!
1 (client + slave)
Crash:
[2818] 17 Oct 19:00:00 # ======= Ooops! Redis 2.4.1 got signal: -11- =======
[2818] 17 Oct 19:00:00 # redis_version:2.4.1
.. ..
[2818] 17 Oct 19:00:00 # /lib/x86_64-linux-gnu/libc.so.6(inet_aton+0x35) [0x7f8a39989215]
[2818] 17 Oct 19:00:00 # /lib/x86_64-linux-gnu/libc.so.6(inet_aton+0x35) [0x7f8a39989215]
[2818] 17 Oct 19:00:00 # /usr/local/bin/redis-server() [0x40ca99]
[2818] 17 Oct 19:00:00 # /usr/local/bin/redis-server(connectWithMaster+0x1a) [0x41c14a]
[2818] 17 Oct 19:00:00 # /usr/local/bin/redis-server(replicationCron+0xe6) [0x41c416]
[2818] 17 Oct 19:00:00 # /usr/local/bin/redis-server(serverCron+0x1c8) [0x40fa78]
[2818] 17 Oct 19:00:00 # /usr/local/bin/redis-server(aeProcessEvents+0x1db) [0x40c45b]
[2818] 17 Oct 19:00:00 # /usr/local/bin/redis-server(aeMain+0x2e) [0x40c70e]
[2818] 17 Oct 19:00:00 # /usr/local/bin/redis-server(main+0x106) [0x40b606]
[2818] 17 Oct 19:00:00 # /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f8a398ba30d]
[2818] 17 Oct 19:00:00 # /usr/local/bin/redis-server() [0x40b72d]
From the log, we can observe the following events before the crash:
[2710] 17 Oct 16:56:25 * MASTER MODE enabled (user request)
[2710] 17 Oct 16:56:30 * Non blocking connect for SYNC fired the event.
[2710] 17 Oct 16:56:30 # I/O error writing to MASTER: Connection refused
[2710] 17 Oct 16:56:30 * Connecting to MASTER...
[2710] 17 Oct 16:56:30 # ======= Ooops! Redis 2.4.1 got signal: -11- =======
In fact, if we really infer the correlations among these events, we can understand the failure.
void slaveofCommand(redisClient *c) {
if (!strcasecmp(c->argv[1]->ptr,"no") &&
!strcasecmp(c->argv[2]->ptr,"one")) {
if (server.masterhost) {
sdsfree(server.masterhost);
server.masterhost = NULL;
if (server.master) freeClient(server.master);
if (server.replstate == REDIS_REPL_TRANSFER)
replicationAbortSyncTransfer();
server.replstate = REDIS_REPL_NONE;
redisLog(REDIS_NOTICE,"MASTER MODE enabled (user request)");
}
} else .. ..
}
--- We know that slaveof no one is received. In particular, we know server.masterhost is set to NULL - eventually lead to crash!
----------------------------------------------------------------------
[2710] 17 Oct 16:56:30 * Non blocking connect for SYNC fired the event.
[2710] 17 Oct 16:56:30 # I/O error writing to MASTER: Connection refused
void syncWithMaster(aeEventLoop *el, int fd, void *privdata, int mask) {
redisLog(REDIS_NOTICE,"Non blocking connect for SYNC fired the event.");
/* Issue the SYNC command */
if (syncWrite(fd,"SYNC \r\n",7,server.repl_syncio_timeout) == -1) {
redisLog(REDIS_WARNING,"I/O error writing to MASTER: %s",
strerror(errno));
goto error;
}
error:
server.replstate = REDIS_REPL_CONNECT;
--- Must connect with MASTER again!
close(fd);
return;
}
-------------------------------------------------------------------------------
[2710] 17 Oct 16:56:30 * Connecting to MASTER...
[2710] 17 Oct 16:56:30 # ======= Ooops! Redis 2.4.1 got signal: -11- =======
void replicationCron(void) {
/* Check if we should connect to a MASTER */
if (server.replstate == REDIS_REPL_CONNECT) {
redisLog(REDIS_NOTICE,"Connecting to MASTER...");
if (connectWithMaster() == REDIS_OK) {
---> Crashed because server.masterhost is set to NULL above!!
}
}
}
Note: The timing of slaveofCommand is important: it has to occur before the syncWithMaster is called, so that making sure the next iteration for event handling, in replicationCron, server.replstate equals to “REDIS_REPL_CONNCT”.
In syncWithMaster, it did not handle the case where server.replstate==REDIS_REPL_NONE --- if that’s the case (as in this scenario), it shouldn’t even try to connect with Master.
Incorrect error handling (missed error subtype).
Return from syncWithMaster ASAP if the machine is no longer a slave (replstate == REDIS_REPL_NONE):
src/replication.c View file @ 98f5abb
@@ -376,6 +376,13 @@ void syncWithMaster(aeEventLoop *el, int fd, void *privdata, int mask) {
REDIS_NOTUSED(privdata);
REDIS_NOTUSED(mask);
+ /* If this event fired after the user turned the instance into a master
+ * with SLAVEOF NO ONE we must just return ASAP. */
+ if (server.replstate == REDIS_REPL_NONE) {
+ close(fd);
+ return;
+ }
+
redisLog(REDIS_NOTICE,"Non blocking connect for SYNC fired the event.");
/* This event should only be triggered once since it is used to have a
* non-blocking connect(2) to the master. It has been triggered when this