CASSANDRA-5391

1. Symptom

Background:

New cassandra release features snappy compression to save space on datacentre and SSL encryption for inter-data center communication.

The symptom is:

Unable to decompress content of incoming TCP connection from inter-datacenter communication. The backup datacenter won’t receive valid data, but instead receive all junks as the replicated. As a result, if the active datacenter goes down, the backup datacenter won’t be able to serve.

Category (in the spreadsheet):

early termination,

1.1 Severity

blocker

1.2 Was there exception thrown? (Exception column in the spreadsheet)

yes.

javax.net.ssl.SSLException: bad record MAC
java.io.IOException: FAILED_TO_UNCOMPRESS

java.io.IOException: CRC unmatched

1.2.1 Were there multiple exceptions?

yes

1.3 Was there a long propagation of the failure?

1.4 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

all affected interconnected data centers

Catastrophic? (spreadsheet column)

no, because there’s no dataloss

2. How to reproduce this failure

2.0 Version

1.2.4

2.1 Configuration

3 nodes in each datacentre. All nodes are configured with snappy compression. All data centers communications are encrypted with ssl. One datacenter must be using AWS and other using Rackspace hosting.

# of Nodes?

2.2 Reproduction procedure

1) Configure SSTable compression switched on (config change)

2) start nodetool rebuild command on AWS. (feature start)

Num triggering events

2.2.1 Timing order (Order important column)

Yes

2.2.2 Events order externally controllable? (Order externally controllable? column)

no (data race)

2.3 Can the logs tell how to reproduce the failure?

2.4 How many machines needed?

2 is the minimum requirement

3. Diagnosis procedure

Error msg?

yes

3.1 Detailed Symptom (where you start)

When performing node rebuild command on AWS to Rackspace, we get SSL problems and snappy compression errors. The setup is simple: 3 nodes in AWS east, 3 nodes in Rackspace.

3.2 Backward inference

Packet-level inspection revealed malformed packets on both end of the communication. So we eliminated the communication problem. It is intuitive that the problem occured on the machine the packet is generated on. Further investigation found out that the problem only happens when the inter-datacenter bandwidth is throttled to 1Mbps. So this leads us to a race condition problem. After doing debugging traces on cassandra, we found out the race condition is in the part where the code handles decompression of sstables when these they are streamed from the remote datacentre. More detailed analysis showed CompressedFileStreamTask function is not sending the right part of the decompressed SSTable when using internode encryption and that causes various IOException observed in the symptoms.

3.3 Are the printed log sufficient for diagnosis?

yes

4. Root cause
Race condition causing function to send the wrong part of the decompressed SSTable when using internode encryption.

4.1 Category:

semantic

4.2 Are there multiple fault?

4.2 Can we automatically test it?

yes

5. Fix

5.1 How?

Fixed the race condition such that CompressedFileStreamTask sends the right part of SSTable to the remote datacenter.

- file.seek(section.left);

+ // seek to the beginning of the section when socket channel is not available
+ if (sc == null)
+ file.seek(section.left);