CASSANDRA-1442

1. Symptom

Using OrderPreservingPartitioner, get range slice operation does not work correctly. Requesting the entire range in 1 block succeeded. However requesting the same range with multiple smaller blocks does not return the full result.

For example, the key range is 100 rows. The request of 100 rows returning 1 block 100 rows works. Requesting 100 rows returning 10 blocks of 10 rows does not return all 100 rows.  

 

Background:

Cassandra provides the ByteOrderedPartitioner (org.apache.cassandra.dht.ByteOrderedPartitioner) for ordered partitioning. This partitioner orders rows lexically by key bytes. You calculate tokens by looking at the actual values of your row key data and using a hexadecimal representation of the leading character(s) in a key. For example, if you wanted to partition rows alphabetically, you could assign an A token using its hexadecimal representation of 41.

Using the ordered partitioner allows range scans over rows. This means you can scan rows as though you were moving a cursor through a traditional index. For example, if your application has user names as the row key, you can scan rows for users whose names fall between Jake and Joe. This type of query is not possible with randomly partitioned row keys, since the keys are stored in the order of their MD5 hash (not sequentially). However, you can achieve the same functionality using column family indexes. Most applications can be designed with a data model that supports ordered queries as slices over a set of columns rather than range scans over a set of rows.

Category (in the spreadsheet):

wrong computation

1.1 Severity

critical

1.2 Was there exception thrown? (Exception column in the spreadsheet)

no exception (acted if everything went successfully)

 

1.2.1 Were there multiple exceptions?

no

1.3 Was there a long propagation of the failure?

no

 

1.4 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

single client (for every client running special get range slice command)

 

Catastrophic? (spreadsheet column)

no

 

2. How to reproduce this failure

2.0 Version

0.6.4

2.1 Configuration

Must have more than 1 node. Must use orderPreservingPartitioner.

 

# of Nodes?

2

2.2 Reproduction procedure

1) upgrade to 0.6.5 from 0.6.4

2) Get range slice requesting a return of the range with multiple blocks (file read) 

Num triggering events

2

 

2.2.1 Timing order (Order important column)

NA

2.2.2 Events order externally controllable? (Order externally controllable? column)

yes

2.3 Can the logs tell how to reproduce the failure?

yes

 

3. Diagnosis procedure

Error msg?

yes

3.1 Detailed Symptom (where you start)

We use getRangeSlices to iterate over all of the keys in a cf (2062 keys total). We are using OrderPreservingPartitioner. We use getRangeSlices with KeyRange using keys (not tokens). If we set the requestBlockCount (aka: KeyRange.setCount()) to a number

greater than 2062 we get all keys in one shot (all is good). If we try to fetch the keys in smaller blocks (requestBlockCount=100) we get BAD RESULTS. We get only 800 unique keys back.

3.2 Backward inference

The symptom could not be reproduced on cluster with 1 machine. So this must be related to how cassandra handle the operation in a cluster of more than 1 nodes.

We took a look at the code and there is only 1 difference between 1 node and multi-node. For multi-node, cassandra must go into each node and find which range they need to return. Then combine the results before returning. Therefore we narrowed our scope to the subrange lookup part of the code. We then think of why only slicing a range with a multi-block return fails. We finally found out that there was an error in the algorithm for splitting the work into multiple block operations.

3.3 Are the printed log sufficient for diagnosis?

yes

4. Root cause

Bug in the algorithm for splitting the work into multiple block operations. This bug is only in the configuration: multi-node ordered partition.

4.1 Category:

semantic

4.2 Are there multiple fault?

no

4.2 Can we automatically test it?

yes

5. Fix

5.1 How?

The fix replaced the algorithm with a new one to achieve the intended functionality. I.E split the work correctly onto different nodes.