[redis]github-145-2 Report
https://github.com/antirez/redis/issues/145
This report actually has two failures: the first failure is crash, the developer fixed it, but there is another bug introduced by the fix. This document discuss the second failure (caused by the first one), while “github-145-1” discusses the oritinal failure.
Version: 2.4.1, with patch:
https://github.com/antirez/redis/commit/98f5abb6e3906ee7fa379be8ec86e18cfc1081a4
applied.
After a redis instance that was recently elected from a slave into a master, it refuses to accept new slaves. The impact will be that data is simply not cannot be replicated.
Severe.
Yes.
Host 1 (the one refuses to take slaves):
[2786] 18 Oct 14:18:17 # Error allocating resoures for the client
Host 2 (the one to become the slave of host 1):
[2854] 18 Oct 14:18:24 # I/O error reading bulk count from MASTER: Connection reset by peer
Yes. See above.
Entire cluster. Potential data loss.
2.4.1 with patch:
https://github.com/antirez/redis/commit/98f5abb6e3906ee7fa379be8ec86e18cfc1081a4
applied.
Host 1: originally a slave of host 3, later user promoted it to a master.
Host 2: after Host 1 becomes master, user further tells it to become a slave of Host 1.
1. First follow the steps to reproduce 145-1
2. send host 2: slaveof host1 port_of_host1
Order is important
No. This is because to reproduce 145-1 the order is not controllable.
Yes.all the key events are logged!
2 (client + host 1 + host 2)
[2786] 18 Oct 14:18:17 # Error allocating resoures for the client
You know the above error msg is pritnted here:
static void acceptCommonHandler(int fd) {
redisClient *c;
if ((c = createClient(fd)) == NULL) {
redisLog(REDIS_WARNING,"Error allocating resoures for the client");
close(fd); /* May be already closed, just ingore errors */
return;
}
}
By tracking down the call chain, we can find that the error is from:
epoll_ctl(state->epfd,op,fd,&ee) returned -1.
Unfortunately, you don’t have too much other information from the log. You don’t even know why “epoll_ctl” failed (it failed because of ENOENT, but this is not logged).
The key is that “SLAVEOF NO ONE” (request 1 in the graph) is occurred before “syncWithMaster” (the edge marked as 1). Unfortunately, while you can infer “slaveof no one” is received, you don’t know it occurred before “syncWithMaster”. Another key evidence is that the fd is closed in syncWithMaster, but not deleted (therefore the failed epoll_ctl is on this fd). But this fd also is not logged.
The key is that the fd is closed in syncWithMaster, but this event is not deleted from eventLoop. Next time there is another master-slave connection, this fd will be reused, resulting in an error in epoll_ctl.
Incorrect error handling.
Delete the event in syncWithMaster first:
src/replication.c View file @ b6fe152
@@ -376,6 +376,12 @@ void syncWithMaster(aeEventLoop *el, int fd, void *privdata, int mask) {
REDIS_NOTUSED(privdata);
REDIS_NOTUSED(mask);
+ redisLog(REDIS_NOTICE,"Non blocking connect for SYNC fired the event.");
+ /* This event should only be triggered once since it is used to have a
+ * non-blocking connect(2) to the master. It has been triggered when this
+ * function is called, so we can delete it. */
+ aeDeleteFileEvent(server.el,fd,AE_READABLE|AE_WRITABLE);
+
/* If this event fired after the user turned the instance into a master
* with SLAVEOF NO ONE we must just return ASAP. */
if (server.replstate == REDIS_REPL_NONE) {
@@ -383,12 +389,6 @@ void syncWithMaster(aeEventLoop *el, int fd, void *privdata, int mask) {
return;
}
- redisLog(REDIS_NOTICE,"Non blocking connect for SYNC fired the event.");
- /* This event should only be triggered once since it is used to have a
- * non-blocking connect(2) to the master. It has been triggered when this
- * function is called, so we can delete it. */
- aeDeleteFileEvent(server.el,fd,AE_READABLE|AE_WRITABLE);
-
/* AUTH with the master if required. */
if(server.masterauth) {
char authcmd[1024];