Slicing through range of columns under quorum consistency level sometimes return the wrong result. Note: this failure actually occurs with all consistency below or equal to quorum.
Background:
* Quorum: the minimum number of replicas whose data needs to be consistent on a read request. It is calculated as: replication factor / 2 + 1. For example, if your replication factor is 3, the quorum = 3/2+1 = 2. It means when a user does a read, at least 2 replicas need to be read and their data need to be consistent before returning the result to user.
wrong computation
critical
no exception. It returned results instead of throwing an exception.
no
Yes. Nodes have to be manipulated when writing the data (stopped and restarted)
single file
no
1.0.0
3 nodes with replication factor of 3 and consistency level set to quorum
minimum of 3
1. start 3 nodes (feature start)
2. set the consistency level to QUORUM (configuration change)
3. Write 10 columns of data to a row (file write)
4. Stop node 1 (disconnect)
5. Sleep 1 second (feature start)
6. delete column 1 (file write)
7. sleep 5 seconds (feature start)
8. start node 1 (restart)
9. stop node 2 (disconnect)
10. sleep 1 second (feature start)
11. delete column 2 (file write)
12. sleep 5 seconds (feature start)
13. start node 2 (restart)
14. stop node 3(disconnect)
15. sleep 1 second (feature start)
16. delete column 3 (file write)
17. sleep 5 seconds (feature start)
18. start node 3 (restart)
19. sleep 2 seconds (feature start)
20. get first 3 columns (file read)
*note sleep is used here to give cassandra enough time to recognize node configuration change
20
yes
yes
yes. Logs can tell you when and what is written and read.
3
no warning message (just returns the wrong result)
Cluster is setup with 3 nodes, replication factor of 3 and consistency policy setup to quorum. After inserting 10 columns into a row, first 3 columns are deleted. A quorum read is initiated with the range of [0, 3]. However the returned result is incorrect.
Because we are getting the wrong response in addition to possibly stale results , we looked at why it is the case. From the log, we can tell that the 3 node was inaccessible successively one after another. Thus we found there are design problem for rare occurrence that returning a quorum read response after a quorum read with successive disconnection of all nodes.
Diagnosing is hard. First the problem is rare and hard to reproduce before knowing the root cause. We have to take consideration of timing for both disconnection and reconnection of nodes and the time required for cassandra to detect configuration change. Also, understanding how quorum read/write works is essential. Without being familiar with the above points, it is difficult to give an diagnosis.
Nodes are successively inaccessible for short period of time while performing quorum write to multiple columns. Since the basic quorum restriction is satisfied, quorum write(mark as delete for first 3 columns) is successful. The resulting 3 replica of the row after 3 column deletion looks like the diagram below:
replica1 | 1 | x | x | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
replica2 | x | 2 | x | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
replica3 | x | x | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
As stated in the first section of the report, quorum read returns the replica with newest timestamp once a quorum number of replica is received by cassandra. However, when performing quorum slice iteration (quorum read multiple columns), it is obvious that no matter which two replicas returned, the end result will be wrong.
Semantic
no
yes
Given a slice request with range of N, if the quorum replicas received by Cassandra have different number of columns, one or more of additional replica needs to be fetched to ensure the correct output.