CASSANDRA-5418
When running nodetool repair command, the repair never completes. The status shows as “frozen” after verifying with nodetool netstats command.
nodetool repair help cassandra achieve eventual consistency by repairing missing or inconsistent data within the cluster.
wrong computation
critical
yes, java.lang.RuntimeException: java.io.EOFException, java.lang.AssertionError
yes
no
single node (on affected node)
no
1.2.5
standard configuration
1
1) add columns of size slightly less than column_index_size (file write)
2) add columns of size exactly as column_index_size (file write)
3) make the column as tombstones by deleting both columns (file write)
4) run a blocking flush by using forceBlockingFlush() (feature start)
5) transfer index from memtable to SSTable (file write)
Note*: the user actually triggered the failure by just running nodetool repair. This failure is an rare occurrence. Because the workload the user is running is large, a small subset of the operations performed will cause cassandra to fail. However, the above reproduction procedure will guarantee to trigger the bug with only 5 operations.
5
yes
yes
yes
1
hard if we don’t know the exact procedure. Easy once we figure out the problem.
yes
When the user runs nodetool repair on the cassandra cluster on any node, the repair process gets stuck. In the failed node’s log, we get java.lang.RuntimeException: java.io.EOFException. On other nodes, we get java.lang.AssertionError: incorrect row data size 130921 written to /var/lib/cassandra/data/EDITED/content_list/footballsite-content_list-tmp-ib-2268-Data.db; correct is 131074. After the user realized something went wrong, the user ran nodetool netstats command to check the status. As expected, the status shows the node is frozen and the repair process never finishes.
Developer analyzed the user’s workload and found that the user is doing lots of deletes. It seems that the assertion is caused by element written twice on ColumnIndexer block boundry. This behavior causes duplication of columns on index block boundry during streaming operation (part of nodetool repair process for internode communication, including communicating to itself)
yes
hard. The user is not sure what exactly caused the failure. The developer has to isolate the problem. Then with domain knowledge, the developer then figures out the problem.
Duplication of columns on index block boundary when appending from stream. This is a faulty operation and cassandra does not know how to handle this case.
Incorrect Handling
no
yes
Avoid duplication of columns on index block boundary when appending from stream. The goal is to write only what we get from the stream.
@@ -99,7 +108,7 @@ public class ColumnIndex
public int writtenAtomCount()
{
- return atomCount + tombstoneTracker.writtenAtom();
+ return tombstoneTracker == null ? atomCount : atomCount + tombstoneTracker.writtenAtom();
}
/**
@@ -153,11 +162,11 @@ public class ColumnIndex
{
firstColumn = column;
startPosition = endPosition;
- // TODO: have that use the firstColumn as min + make sure we
- // optimize that on read
- endPosition += tombstoneTracker.writeOpenedMarker(firstColumn, output, atomSerializer);
+ // TODO: have that use the firstColumn as min + make sure we optimize that on read
+ if (tombstoneTracker != null)
+ endPosition += tombstoneTracker.writeOpenedMarker(firstColumn, output, atomSerializer);
blockSize = 0; // We don't count repeated tombstone marker in the block size, to avoid a situation
- // where we wouldn't make any problem because a block is filled by said marker
+ // where we wouldn't make any progress because a block is filled by said marker
}