CASSANDRA-3482

1. Symptom

When flushing Memtable to disk as SSTable, we hit a flush assertion error. Note that this only occurs to SSTable that is used for “Eagerly rewrite”, which is a redundant copy of the actual user’s data in SSTable. Therefore, this error won’t result in any dataloss.

 

Background:

Eagerly rewrite or supercede or defragmentation is an optimization to improve read performance during read time. If a read request requires access of more than a threshold of sstables, we eagerly rewrite those data into a new sstable. Next time, we can directly access it. In normal cases, after this eagerly-rewrite SSTable is safely stpred to disk, the original SSTables would be invalidated. But in this case, since the store operation didn’t finish successfully, the old data won’t be invalidated. Thus no dataloss.

Category (in the spreadsheet):

early termination,

 

1.1 Severity

Critical

1.2 Was there exception thrown? (Exception column in the spreadsheet)

yes. Only the assertion message and no other unexpected log message.

INFO [FlushWriter:9] 2011-11-10 13:08:58,882 Memtable.java (line 237) Writing Memtable-UserData@1388955390(25676955/430716097 serialized/live bytes, 478913 ops)

ERROR [FlushWriter:9] 2011-11-10 13:08:59,513 AbstractCassandraDaemon.java (line 133) Fatal exception in thread Thread[FlushWriter:9,5,main]

java.lang.AssertionError: CF size changed during serialization: was 4 initially but 3 written

at org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:94)

at org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:112)

at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:177)

at org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:264)

at org.apache.cassandra.db.Memtable.access$400(Memtable.java:47)

at org.apache.cassandra.db.Memtable$4.runMayThrow(Memtable.java:289)

at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

1.2.1 Were there multiple exceptions?

no

1.3 Was there a long propagation of the failure?

no

 

1.4 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

single file

 

Catastrophic? (spreadsheet column)

no

2. How to reproduce this failure

2.0 Version

1.0.3

2.1 Configuration

standard configuration with “eagerly rewrite” (supersede”). A supersede operation must then performed.

# of Nodes?

1

2.2 Reproduction procedure

1. Enable Eagerly rewrite feature (config change)

2. Read data that is stored in multiple SSTtables to trigger the eagerly rewrite (file read)

3. Flush memtable generated by eagerly rewrite to SSTable (feature start)

Num triggering events

3

 

2.2.1 Timing order (Order important column)

yes

2.2.2 Events order externally controllable? (Order externally controllable? column)

yes

2.3 Can the logs tell how to reproduce the failure?

yes

2.4 How many machines needed?

1

3. Diagnosis procedure

Error msg?

yes

3.1 Detailed Symptom (where you start)

When flushing Memtable to disk as SSTable, we hit a flush assertion error.

3.2 Backward inference

We take a closer look at the assertion: java.lang.AssertionError: CF size changed during serialization: was 4 initially but 3 written. What could change the column family size? A bit of domain knowledge from the developer leads us to eagerly rewrite. During eagerly rewrite operation, the column family size will change. But why does it give us an assertion. We inspected the code for purging Memtable to disk. From there we noticed that RowMutation rm = new RowMutation(cfs.table.name, new Row(filter.key, returnCF)) is using returnCF as a parameter. This is incorrect because we need to use a new/cloned SSTable as the parameter.

 

4. Root cause

When purging memtable to disk as SSTable, we used an old column family definition. We must use use a new (clone) a column family definition with updated values.

4.1 Category:

semantic

4.2 Are there multiple fault?

no

4.2 Can we automatically test it?

yes

5. Fix

5.1 How?

The fix change the parameter from returnCF to returnCF.cloneMe(). This allows cassandra to return a new/cloned column family with updated values.

-                RowMutation rm = new RowMutation(cfs.table.name, new Row(filter.key, returnCF));
+                RowMutation rm = new RowMutation(cfs.table.name, new Row(filter.key, returnCF.cloneMe()));