CASSANDRA-5418

1. Symptom

When running nodetool repair command, the repair never completes. The status shows as “frozen” after verifying with nodetool netstats command.

 

Background:

nodetool repair help cassandra achieve eventual consistency by repairing missing or inconsistent data within the cluster.

Category:

wrong computation 

1.1 Severity

critical

1.2 Was there exception thrown? (Exception column in the spreadsheet)

yes, java.lang.RuntimeException: java.io.EOFException, java.lang.AssertionError

 

1.2.1 Were there multiple exceptions?

yes

1.3 Was there a long propagation of the failure?

no

 

1.4 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

single node (on affected node)

 

Catastrophic? (spreadsheet column)

no 

2. How to reproduce this failure

2.0 Version

1.2.5

2.1 Configuration

standard configuration

# of Nodes?

1

2.2 Reproduction procedure

1) add columns of size slightly less than column_index_size (file write)

2) add columns of size exactly as column_index_size (file write)

3) make the column as tombstones by deleting both columns (file write)

4) run a blocking flush by using forceBlockingFlush() (feature start)

5) transfer index from memtable to SSTable (file write)

Note*: the user actually triggered the failure by just running nodetool repair. This failure is an rare occurrence. Because the workload the user is running is large, a small subset of the operations performed will cause cassandra to fail. However, the above reproduction procedure will guarantee to trigger the bug with only 5 operations.

 

Num triggering events

5 

2.2.1 Timing order (Order important column)

yes

2.2.2 Events order externally controllable? (Order externally controllable? column)

yes

2.3 Can the logs tell how to reproduce the failure?

yes

2.4 How many machines needed?

1

2.5 How hard is the reproduction?

hard if we don’t know the exact procedure. Easy once we figure out the problem.

3. Diagnosis procedure

Error msg?

yes

3.1 Detailed Symptom (where you start)

When the user runs nodetool repair on the cassandra cluster on any node, the repair process gets stuck. In the failed node’s log, we get java.lang.RuntimeException: java.io.EOFException. On other nodes, we get java.lang.AssertionError: incorrect row data size 130921 written to /var/lib/cassandra/data/EDITED/content_list/footballsite-content_list-tmp-ib-2268-Data.db; correct is 131074. After the user realized something went wrong, the user ran nodetool netstats command to check the status. As expected, the status shows the node is frozen and the repair process never finishes.

3.2 Backward inference

Developer analyzed the user’s workload and found that the user is doing lots of deletes. It seems that the assertion is caused by element written twice on ColumnIndexer block boundry. This behavior causes duplication of columns on index block boundry during streaming operation (part of nodetool repair process for internode communication, including communicating to itself)

3.3 Are the printed log sufficient for diagnosis?

yes

3.4 How hard is the diagnosis?

hard. The user is not sure what exactly caused the failure. The developer has to isolate the problem. Then with domain knowledge, the developer then figures out the problem.

 

4. Root cause

Duplication of columns on index block boundary when appending from stream. This is a faulty operation and cassandra does not know how to handle this case.

4.1 Category:

Incorrect Handling

4.2 Are there multiple fault?

no

4.2 Can we automatically test it?

yes

5. Fix

5.1 How?

Avoid duplication of columns on index block boundary when appending from stream. The goal is to write only what we get from the stream.

@@ -99,7 +108,7 @@ public class ColumnIndex

        public int writtenAtomCount()
        {
-            return atomCount + tombstoneTracker.writtenAtom();
+            return tombstoneTracker == null ? atomCount : atomCount + tombstoneTracker.writtenAtom();
        }

        /**
@@ -153,11 +162,11 @@ public class ColumnIndex
            {
                firstColumn = column;
                startPosition = endPosition;
-                // TODO: have that use the firstColumn as min + make sure we
-                // optimize that on read
-                endPosition += tombstoneTracker.writeOpenedMarker(firstColumn, output, atomSerializer);
+                // TODO: have that use the firstColumn as min + make sure we optimize that on read
+                if (tombstoneTracker != null)
+                    endPosition += tombstoneTracker.writeOpenedMarker(firstColumn, output, atomSerializer);
                blockSize = 0; // We don't count repeated tombstone marker in the block size, to avoid a situation
-                               // where we wouldn't make any problem because a block is filled by said marker
+                               // where we wouldn't make any progress because a block is filled by said marker
            }