[redis]github-142 Report

https://github.com/antirez/redis/issues/142

1. Symptom

When “appendonly” is enabled, everytime after server restart the data recovery process cannot finish --- cannot recover data, thus resulting in data loss.

The docs say "When you restart Redis it will re-play the AOF to rebuild the state." Therefore I thought it would honor the FLUSHALL command, too, and end up with an empty DB. I have limited memory and just throw away my whole database regularly via FLUSHALL. If it crashes after a while, I'm unable to recover the append-only file because it just contains way too much data. “

1.1 Severity

Severe. Listed in “redis.io/topics/problems” page.

1.2 Was there exception thrown?

Yes. Append only process cannot finish due to out of memory.

1.2.1 Were there multiple exceptions?

Yes. Server restart; Append only process cannot finish due to out of memory.

1.3 Affect scope

The entire cluster.

2. How to reproduce this failure

2.0 Version

2.4.0

2.1 Configuration

appendonly yes

appendfilename appendonly.aof

No change to other config parameter

2.2 Reproduction procedure

$ redis-server redis.conf

1. populate the db:

$ redis-cli

> set key1 a

> set key2 a

2. flushall

> flushall

3. restart redis-server:

observe that flushall wasn’t re-executed (the memory usage of redis is large).

2.2.1 Timing order

Above order.

2.2.2 Events order externally controllable?

Absolutely.

2.3 Can the logs tell how to reproduce the failure?

Yes. Key events:

flushall:

[977] 16 Oct 23:12:32 * DB saved on disk

Restart:

( system reboot )

2.4 How many machines needed?

1.

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

After restart, observe the system cannot finish the reload of appendonly.aof because of memory consumption.

3.2 Backward inference

It is not hard to know that the memory usage is abnormally high  because the flushall should have brought down the mem-usage.

This naturally lead users to suspect that during the AOF recovery, flushall wasn’t replayed.

If we further examine appendonly.aof, we can see “flushall” isn’t logged!

The reason: see above graph:

in flushallCommand (triggered when user issued a “flushall”), server.dirty is reset to 0, which will result feedAppendOnlyfile function not called!!!

4. Root cause

Server.dirty is reset in flushallCommand, causing “flushall” not logged in appendonly log. The fix is that when flushall is called, further save and restore the dirty bit.

4.1 Category:

Semantic + incorrect handling: every time the system restart, trying to recover the aof, the handling also should be improved: it cannot bring down the server everytime.