CASSANDRA-3156

1. Symptom

java.lang.AssertionError exception when Row Repair is run.

Background Information:

Row Repair is a feature that checks and fixes the inconsistencies of row data. It works like this.

Optimal phase: Coordinator reads data from the closest replica, and the digest of the data from other replicas

If there is a mismatch (optimism fail), we go to the repair phase:

Coordinator sends data reads to all replicas to merge + repair

A "digest" query is like a read query except that instead of the receiving node actually returning the data, it only returns a digest (hash) of the would-be data.

The intent of submitting a digest query is to discover whether two or more nodes agree on what the current data is, without sending the data over the network. In particular for large amounts of data, this is a significant saving of bandwidth cost relative to sending the full data response.

Keep in mind that the cost of potentially going down to disk, and most or all of the CPU cost, associated with a query will still be taken on nodes that receive digest queries. The optimization is only for bandwidth.

In this failure, this exception could occur even when there is no actual data inconsistency.

Category (in the spreadsheet):

wrong computation

 

1.1 Severity

Blocker

1.2 Was there exception thrown? (Exception column in the spreadsheet)

yes. java.lang.AssertionError

 

1.2.1 Were there multiple exceptions?

no

1.3 Was there a long propagation of the failure?

no

 

1.4 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

single client

 

Catastrophic? (spreadsheet column)

no

 

2. How to reproduce this failure

2.0 Version

1.0.0

2.1 Configuration

Number of nodes must be greater or equal to 2. The node that is failing must not have a copy of the data which the local coordinator is trying to access.

 

# of Nodes?

2

2.2 Reproduction procedure

1. start row repair (feature start)

 

2.2.1 Timing order (Order important column)

yes

2.2.2 Events order externally controllable? (Order externally controllable? column)

yes

2.3 Can the logs tell how to reproduce the failure?

yes

2.4 How many machines needed?

2

3. Diagnosis procedure

Error msg?

yes. java.lang.AssertionError exception  in RowRepairResolver

3.1 Detailed Symptom (where you start)

Getting java.lang.AssertionError exception in RowRepairResolver.

3.2 Backward inference

After looking at the log, it seems that some spurious (false/random) digest mismatches mixed in. It seems to happen when coordinator does not have a copy of the data. Thus the data must came from a different node. However, when we sent a data request, we got a digest (see background information). In the process of looking at the code, we find that the code did not reset the buffer before each use.

3.3 Are the printed log sufficient for diagnosis?

yes

 

4. Root cause

Did not reset buffer before each use, thus getting random leftover data from previous usage.

4.1 Category:

Semantic

4.2 Are there multiple fault?

no

4.2 Can we automatically test it?

yes

5. Fix

5.1 How?

Wiped the buffer before use.

            DataOutputBuffer out = threadLocalOut.get();
+            out.reset();