1. Symptom

A race condition in redis cluster setting resulted in client dataloss.

1.1 Severity

critical

1.2 Was there exception thrown?

1.2.1 Were there multiple exceptions?

1.3 Scope of the failure

Single or a few clients

2. How to reproduce this failure

2.0 Version

redis-2.6-rc1

2.1 Configuration

Cluster configuration

2.2 Reproduction procedure

1. migrate some keys from node A to B (feature start)

2. While migrating, LPUSH key to node A, here A must be still in A (feature start)

3. while migrating, LPUSH key to B

You should observe the push in step 2 are lost.

2.2.1 Timing order

In this order

2.2.2 Events order externally controllable?

No (node nondeterministic)

2.3 Can the logs tell how to reproduce the failure?

No.

2.4 How many machines needed?

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

Hard to notice any starting point. you notice the data loss, but not sure how it occurred.

3.2 Backward inference

Impossible as it’s very likely that when you notice the data loss, nothing can be recovered.

4. Root cause

The design is that when a key space is migrating from A to B, A and B will both accept queries for keys that hasn’t been migrated yet. If a key has already been migrated, A will no longer accept the query.

4.1 Category:

Race