[redis]github-145-2 Report

https://github.com/antirez/redis/issues/145

This report actually has two failures: the first failure is crash, the developer fixed it, but there is another bug introduced by the fix. This document discuss the second failure (caused by the first one), while “github-145-1” discusses the oritinal failure.

Version: 2.4.1, with patch:

https://github.com/antirez/redis/commit/98f5abb6e3906ee7fa379be8ec86e18cfc1081a4

applied.

1. Symptom

After a redis instance that was recently elected from a slave into a master, it refuses to accept new slaves. The impact will be that data is simply not cannot be replicated.

1.1 Severity

Severe.

1.2 Was there exception thrown?

Yes.

Host 1 (the one refuses to take slaves):

[2786] 18 Oct 14:18:17 # Error allocating resoures for the client

Host 2 (the one to become the slave of host 1):

[2854] 18 Oct 14:18:24 # I/O error reading bulk count from MASTER: Connection reset by peer

1.2.1 Were there multiple exceptions?

Yes. See above.

1.3 Scope of the failure

Entire cluster. Potential data loss.

2. How to reproduce this failure

2.0 Version

2.4.1 with patch:

https://github.com/antirez/redis/commit/98f5abb6e3906ee7fa379be8ec86e18cfc1081a4

applied.

2.1 Configuration

Host 1: originally a slave of host 3, later user promoted it to a master.

Host 2: after Host 1 becomes master, user further tells it to become a slave of Host 1.

2.2 Reproduction procedure

1. First follow the steps to reproduce 145-1

2. send host 2: slaveof host1 port_of_host1

2.2.1 Timing order

Order is important

2.2.2 Events order externally controllable?

No. This is because to reproduce 145-1 the order is not controllable.

2.3 Can the logs tell how to reproduce the failure?

Yes.all the key events are logged!

2.4 How many machines needed?

2 (client + host 1 + host 2)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

[2786] 18 Oct 14:18:17 # Error allocating resoures for the client

3.2 Backward inference

You know the above error msg is pritnted here:

static void acceptCommonHandler(int fd) {

redisClient *c;

if ((c = createClient(fd)) == NULL) {

redisLog(REDIS_WARNING,"Error allocating resoures for the client");

close(fd); /* May be already closed, just ingore errors */

return;

}

By tracking down the call chain, we can find that the error is from:

epoll_ctl(state->epfd,op,fd,&ee) returned -1.

Unfortunately, you don’t have too much other information from the log. You don’t even know why “epoll_ctl” failed (it failed because of ENOENT, but this is not logged).

The key is that “SLAVEOF NO ONE” (request 1 in the graph) is occurred before “syncWithMaster” (the edge marked as 1). Unfortunately, while you can infer “slaveof no one” is received, you don’t know it occurred before “syncWithMaster”. Another key evidence is that the fd is closed in syncWithMaster, but not deleted (therefore the failed epoll_ctl is on this fd). But this fd also is not logged.

4. Root cause

The key is that the fd is closed in syncWithMaster, but this event is not deleted from eventLoop. Next time there is another master-slave connection, this fd will be reused, resulting in an error in epoll_ctl.

4.1 Category:

Incorrect error handling.

5. Fix

5.1 How?

Delete the event in syncWithMaster first:

src/replication.c View file @ b6fe152

@@ -376,6 +376,12 @@ void syncWithMaster(aeEventLoop *el, int fd, void *privdata, int mask) {

REDIS_NOTUSED(privdata);

REDIS_NOTUSED(mask);

+ redisLog(REDIS_NOTICE,"Non blocking connect for SYNC fired the event.");

+ /* This event should only be triggered once since it is used to have a

+ * non-blocking connect(2) to the master. It has been triggered when this

+ * function is called, so we can delete it. */

+ aeDeleteFileEvent(server.el,fd,AE_READABLE|AE_WRITABLE);

/* If this event fired after the user turned the instance into a master

* with SLAVEOF NO ONE we must just return ASAP. */

if (server.replstate == REDIS_REPL_NONE) {

@@ -383,12 +389,6 @@ void syncWithMaster(aeEventLoop *el, int fd, void *privdata, int mask) {

return;

}

- redisLog(REDIS_NOTICE,"Non blocking connect for SYNC fired the event.");

- /* This event should only be triggered once since it is used to have a

- * non-blocking connect(2) to the master. It has been triggered when this

- * function is called, so we can delete it. */

- aeDeleteFileEvent(server.el,fd,AE_READABLE|AE_WRITABLE);

/* AUTH with the master if required. */

if(server.masterauth) {

char authcmd[1024];