[redis]github-547-2 Report

In the bug report of github-547, the developers discovered two independent bugs. This report documents the 2nd bug.

https://github.com/antirez/redis/issues/547

https://github.com/antirez/redis/commit/5b63ccce6c5a3b2097c87cc1b6af314c209172ef

1. Symptom

Client reply buffer limit (a new feature introduced in redis 2.6)’s computation is broken for some requests. This was discovered in the “INFO” logging in bug github-547.

1.1 Severity

Severe

1.2 Was there exception thrown?

No.

1.2.1 Were there multiple exceptions?

No

1.3 Scope of the failure

a few clients

2. How to reproduce this failure

The same way as github-547.

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

The symptom is logged:

[10845] 09 Jun 04:42:08.293 # --- CURRENT CLIENT INFO

[10845] 09 Jun 04:42:08.293 # client: addr=10.108.127.147:53982 fd=125 age=13903 idle=0 flags=N db=12 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=18446744073709398960 events=r cmd=eval

3.2 Backward inference

This log msg is printed by:

/* Turn a Redis client into an sds string representing its state. */

sds getClientInfoString(redisClient *client) {

  .. ..

   return sdscatprintf(sdsempty(),

        "addr=%s:%d fd=%d name=%s age=%ld idle=%ld flags=%s db=%d sub=%d psub=%d multi=%d qbuf=%lu qbuf-free=%lu obl=%lu oll=%lu omem=%lu events=%s cmd=%s",

        (client->flags & REDIS_UNIX_SOCKET) ? server.unixsocket : ip,

        port,

        client->fd,

        client->name ? (char*)client->name->ptr : "",

        (long)(server.unixtime - client->ctime),

        (long)(server.unixtime - client->lastinteraction),

        flags,

        client->db->id,

        (int) dictSize(client->pubsub_channels),

        (int) listLength(client->pubsub_patterns),

        (client->flags & REDIS_MULTI) ? client->mstate.count : -1,

        (unsigned long) sdslen(client->querybuf),

        (unsigned long) sdsavail(client->querybuf),

        (unsigned long) client->bufpos,

        (unsigned long) listLength(client->reply),

        getClientOutputBufferMemoryUsage(client),

        events,

        client->lastcmd ? client->lastcmd->name : "NULL");

}

/* This function returns the number of bytes that Redis is virtually

 * using to store the reply still not read by the client.

 * It is "virtual" since the reply output list may contain objects that

 * are shared and are not really using additional memory.

 *

 * The function returns the total sum of the length of all the objects

 * stored in the output list, plus the memory used to allocate every

 * list node. The static reply buffer is not taken into account since it

 * is allocated anyway.

 *

 * Note: this function is very fast so can be called as many time as

 * the caller wishes. The main usage of this function currently is

 * enforcing the client output length limits. */

unsigned long getClientOutputBufferMemoryUsage(redisClient *c) {

    unsigned long list_item_size = sizeof(listNode)+sizeof(robj);

    return c->reply_bytes + (list_item_size*listLength(c->reply));

}

So we know the computation messed up with the calculation of c->reply_bytes. But it is hard to further infer where the computation messed up.

4. Root cause

Wrong computation on c->reply_bytes when the objects are glued. In this case, the computation will subtract more from c->reply_bytes than it should, causing c->reply_bytes to grow to negative (and was interpreted as a large positive).

4.1 Category:

Semantic

5. Fix

5.1 How?

         /* Only glue when the next node is non-NULL (an sds in this case) */

         if (next->ptr != NULL) {

+            c->reply_bytes -= zmalloc_size_sds(len->ptr);

+            c->reply_bytes -= zmalloc_size_sds(next->ptr);

             len->ptr = sdscatlen(len->ptr,next->ptr,sdslen(next->ptr));

+            c->reply_bytes += zmalloc_size_sds(len->ptr);

             listDelNode(c->reply,ln->next);

         }

     }