CASSANDRA-2643

1. Symptom

Slicing through range of columns under quorum consistency level sometimes return the wrong result. Note: this failure actually occurs with all consistency below or equal to quorum.

Background:

* Quorum: the minimum number of replicas whose data needs to be consistent on a read request. It is calculated as: replication factor / 2 + 1. For example, if your replication factor is 3, the quorum = 3/2+1 = 2. It means when a user does a read, at least 2 replicas need to be read and their data need to be consistent before returning the result to user.

Category (in the spreadsheet):

wrong computation

1.1 Severity

critical

1.2 Was there exception thrown? (Exception column in the spreadsheet)

no exception. It returned results instead of throwing an exception.

1.2.1 Were there multiple exceptions?

1.3 Was there a long propagation of the failure?

Yes. Nodes have to be manipulated when writing the data (stopped and restarted)

1.4 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

single file

Catastrophic? (spreadsheet column)

2. How to reproduce this failure

2.0 Version

1.0.0

2.1 Configuration

3 nodes with replication factor of 3 and consistency level set to quorum

# of Nodes?

minimum of 3

2.2 Reproduction procedure

1. start 3 nodes (feature start)

2. set the consistency level to QUORUM (configuration change)

3. Write 10 columns of data to a row (file write)

4. Stop node 1 (disconnect)

5. Sleep 1 second (feature start)

6. delete column 1 (file write)

7. sleep 5 seconds (feature start)

8. start node 1 (restart)

9. stop node 2 (disconnect)

10. sleep 1 second (feature start)

11. delete column 2 (file write)

12. sleep 5 seconds (feature start)

13. start node 2 (restart)

14. stop node 3(disconnect)

15. sleep 1 second (feature start)

16. delete column 3 (file write)

17. sleep 5 seconds (feature start)

18. start node 3 (restart)

19. sleep 2 seconds (feature start)

20. get first 3 columns (file read)

*note sleep is used here to give cassandra enough time to recognize node configuration change

Num triggering events

2.2.1 Timing order (Order important column)

yes

2.2.2 Events order externally controllable? (Order externally controllable? column)

yes

2.3 Can the logs tell how to reproduce the failure?

yes. Logs can tell you when and what is written and read.

2.4 How many machines needed?

3. Diagnosis procedure

Error msg?

no warning message (just returns the wrong result)

3.1 Detailed Symptom (where you start)

Cluster is setup with 3 nodes, replication factor of 3 and consistency policy setup to quorum. After inserting 10 columns into a row, first 3 columns are deleted. A quorum read is initiated with the range of [0, 3]. However the returned result is incorrect.

3.2 Backward inference

Because we are getting the wrong response in addition to possibly stale results , we looked at why it is the case. From the log, we can tell that the 3 node was inaccessible successively one after another. Thus we found there are design problem for rare occurrence that returning a quorum read response after a quorum read with successive disconnection of all nodes.

How hard is the diagnosis?

Diagnosing is hard. First the problem is rare and hard to reproduce before knowing the root cause. We have to take consideration of timing for both disconnection and reconnection of nodes and the time required for cassandra to detect configuration change. Also, understanding how quorum read/write works is essential. Without being familiar with the above points, it is difficult to give an diagnosis.

4. Root cause

Nodes are successively inaccessible for short period of time while performing quorum write to multiple columns. Since the basic quorum restriction is satisfied, quorum write(mark as delete for first 3 columns) is successful. The resulting 3 replica of the row after 3 column deletion looks like the diagram below:

replica1	1	x	x	4	5	6	7	8	9	10
replica2	x	2	x	4	5	6	7	8	9	10
replica3	x	x	3	4	5	6	7	8	9	10

As stated in the first section of the report, quorum read returns the replica with newest timestamp once a quorum number of replica is received by cassandra. However, when performing quorum slice iteration (quorum read multiple columns), it is obvious that no matter which two replicas returned, the end result will be wrong.

4.1 Category:

Semantic

4.2 Are there multiple fault?

4.2 Can we automatically test it?

yes

5. Fix

5.1 How?

Given a slice request with range of N, if the quorum replicas received by Cassandra have different number of columns, one or more of additional replica needs to be fetched to ensure the correct output.