[redis]github-366 Report

https://github.com/antirez/redis/issues/366

1. Symptom

Redis slave crashed when RDB transfer encountered network error

1.1 Severity

Critical

1.2 Was there exception thrown?

Yes. Crashed

1.2.1 Were there multiple exceptions?

Yes. Network problem + crash.

1.3 Scope of the failure

single slave

2. How to reproduce this failure

2.0 Version

2.4.6

2.1 Configuration

Enable RDB

2.2 Reproduction procedure

1. Introduce a network error btw slave & master while the RDB transmission was going (disconnect)

2.2.1 Timing order

Single event

2.2.2 Events order externally controllable?

No. 1 must happen during the RDB transfer.

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

2 (master + slave)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

Crash

3.2 Backward inference

From the log we know before the crash there was a network error. The crash was because RDB data get corrupted:

"Unknown RDB encoding type

So in fact the crash was because the RDB data got corrupted due to the network error

4. Root cause

RDB was corrupted during the transmission btw. master -> slave, due to network error. It should have added checksum to the RDB data.

4.1 Category:

Incorrect error handling (not handled)

5. Fix

5.1 How?

Added checksum for RDB transmission.