CASSANDRA-5102

1. Symptom

After performing rolling upgrade from cassandra from 1.1.7 to 1.2.0, new cassandra nodes can only see other 1.2.0 nodes. Nodes still running cassandra 1.1.7 seems to disappear.

 

Category (in the spreadsheet):

wrong computation

1.1 Severity

blocker

1.2 Was there exception thrown? (Exception column in the spreadsheet)

yes, java.lang.RuntimeException: java.net.UnknownHostException

 

1.2.1 Were there multiple exceptions?

no

 

1.3 Was there a long propagation of the failure?

no

 

1.4 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

all nodes on older version of cassandra after upgrade

 

Catastrophic? (spreadsheet column)

no

 

2. How to reproduce this failure

2.0 Version

1.2.0

2.1 Configuration

At least 2 node and perform rolling upgrade with at least one 1.1.7 node and 1.2.0 node

 

# of Nodes?

2

2.2 Reproduction procedure

1. Perform rolling upgrade on 1 machine  from 1.1.7 to 1.2.0(feature start)

 

Num triggering events

1

2.2.1 Timing order (Order important column)

NA

2.2.2 Events order externally controllable? (Order externally controllable? column)

yes

2.3 Can the logs tell how to reproduce the failure?

yes

2.4 How many machines needed?

2. Need to perform rolling upgrade.

2.5 How hard is the reproduction?

easy

 

3. Diagnosis procedure

Error msg?

Whether there are any error/warning message during the failure execution. This column should start with “yes” or “no”

3.1 Detailed Symptom (where you start)

Initially, the user have a fully functional cassandra 1.1.7 cluster containing multiple nodes. While performing rolling upgrade, some nodes have finished upgrading to 1.2.0 while other nodes are queued to be upgraded, but still on 1.1.7. The problem is exposed when the user cannot see nodes on older version of cassandra (1.1.7). Running node ring command only showed nodes with 1.2.0 connected to the cluster. However, the problem no longer affected the cluster after nodes moved to 1.2.0.

3.2 Backward inference

It seems that the problem is node with cassandra version 1.2.0 cannot see nodes with cassandra version 1.1.7. With the domain knowledge of the developer, it is obvious that multiple release cycle in cassandra caused this regression.  The cause of these exceptions is CASSANDRA-4576. There, we added checks against VERSION_11 to prevent using the compatible mode with newer node that didn't need it. VERSION_11 has an actual value of 4. We closed the ticket on Sept 18, and that was that.

Fast forward to November, where we closed CASSANDRA-4880. To do this, we needed a protocol version bump, and created VERSION_117, which has an actual value of 5. Unfortunately we used <= comparisons in CASSANDRA-4576, but now had created a version higher than VERSION_11 that still needed the compatibility, and we got our original bug back.

The effect of this is if you upgrade from nodes on 1.1.7 or later to 1.2.0, the 1.2.0 nodes won't be able to gossip with the 1.1.7 nodes and they won't be visible in ring output on the 1.2.0 node until they too are on 1.2.0. The 1.1.7 nodes will still know about the 1.2.0 node, but they won't be able to successfully gossip (communicate) with it, and keep it marked down.

3.3 Are the printed log sufficient for diagnosis?

yes

 

4. Root cause

When incrementing version number, developer forgot to increment version compatibility check as well.

4.1 Category:

semantic

4.2 Are there multiple fault?

no

4.2 Can we automatically test it?

yes

5. Fix

5.1 How?

-        if (version <= MessagingService.VERSION_11)
+        if (version < MessagingService.VERSION_12)