CASSANDRA-3551

1. Symptom

User upgraded from 1.0.2 to 1.0.5. Some column families always get TimeoutException when doing RangeSlice with Quorum and rplication factor of 3 .

 

Category (in the spreadsheet):

early termination,

1.1 Severity

critical

1.2 Was there exception thrown? (Exception column in the spreadsheet)

yes, java.util.concurrent.TimeoutException

 

1.2.1 Were there multiple exceptions?

no

 

1.3 Was there a long propagation of the failure?

no

 

1.4 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

single file

 

Catastrophic? (spreadsheet column)

no

2. How to reproduce this failure

2.0 Version

1.0.5

2.1 Configuration

Quroum with replication factor of 3

 

# of Nodes?

3

2.2 Reproduction procedure

1) Upgrade cluster from 1.0.2 to 1.0.5 (feature start)

2) rangeslice a column family (file read)

Num triggering events

2 

2.2.1 Timing order (Order important column)

yes

2.2.2 Events order externally controllable? (Order externally controllable? column)

yes

2.3 Can the logs tell how to reproduce the failure?

yes

2.4 How many machines needed?

3

3. Diagnosis procedure

Error msg?

yes

3.1 Detailed Symptom (where you start)

User upgraded from 1.0.2 to 1.0.5. Some column families always get TimeoutException when doing RangeSlice with Quorum and rplication factor of 3. No Error in node logs, no anomalies in system monitoring (like sudden increased disk latency). Only cassandra’s storageproxy latency goes way up (hundreds of miliseconds) before failure.

3.2 Backward inference

Closer look at the code reveals that there are some changes in the RowStorageProxy algorithm between 1.0.2 to 1.0.5. The developer did not finish implementing the new algorithm.

 

4. Root cause

When rewriting storage proxy, the developer did not finish the implementation.

4.1 Category:

Semantic

4.2 Are there multiple fault?

no

4.2 Can we automatically test it?

yes

5. Fix

5.1 How?

Finished implementing features in StorageProxy

+            WriteResponse response = new WriteResponse(rm.getTable(), rm.key(), true);

+            Message responseMessage = WriteResponse.makeWriteResponseMessage(message, response);

+            MessagingService.instance().sendReply(responseMessage, id, message.getFrom());