Cassandra-4262

1. Symptom

Does not preserve Compatibility with index queries against 1.0 nodes. Command usage of RangeSliceCommand that used to work on version 1.0 won't work on 1.1 nodes (no data damange though). This will cause problem especially during live upgrade.

 

Category (in the spreadsheet):

early termination

1.1 Severity

critical

1.2 Was there exception thrown? (Exception column in the spreadsheet)

no, no warning or exception. Fail silently.

 

1.2.1 Were there multiple exceptions?

no

 

1.3 Was there a long propagation of the failure?

Yes. Need to perform a rolling update from 1.0-> 1.1. Rolling update means update without interruption.

 

1.4 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

all clients. RangeSlice command is a frequently used command. A client should except to hit the failed code path easily.

 

Catastrophic? (spreadsheet column)

no, no dataloss. Can be recovered

 

2. How to reproduce this failure

perform rolling update from cassandra 1.0 => 1.1

2.0 Version

1.1

2.1 Configuration

Setup standard 1.0 node 

# of Nodes?

1

2.2 Reproduction procedure

Start rolling upgrade from version 1.0 to 1.1. Then send out range slice command.

Events column in the spreadsheet.

1. start rolling upgrade

2. sending out rangeslice command (feature start)

 

Num triggering events

1

 

2.2.1 Timing order (Order important column)

Yes

2.2.2 Events order externally controllable? (Order externally controllable? column)

Yes

2.3 Can the logs tell how to reproduce the failure?

yes. Logs tell us that the system is rollign forward

2.4 How many machines needed?

1

 

3. Diagnosis procedure

Error msg?

yes

3.1 Detailed Symptom (where you start)

rangeslice command errors out

3.2 Backward inference

Since we are in the middle of a  rolling upgrade from 1.0 to 1.1.  It is very l likely the problem broken by the rolling upgrade. Indeed, if we look at the code, 1.1 does not preserve compatibility with index queries against 1.0 nodes.

 

4. Root cause

Did not preserve backwards compatibility with earlier version.

4.1 Category:

semantic

4.2 Are there multiple fault?

no

4.2 Can we automatically test it?

yes

5. Fix

5.1 How?

The fix made a separate code path for to make the  nodes compatible between two version. THe developer added a function to convert IndexScanCommand  for backwards compatibility