CASSANDRA-3289

1. Symptom

When flushing data in the memtable to SSTable, the stored column is in random form (Wrong output).

 

Category (in the spreadsheet):

wrong computation

1.1 Severity

critical

1.2 Was there exception thrown? (Exception column in the spreadsheet)

yes,  java.lang.AssertionError: Added column does not sort as the last column

 

1.2.1 Were there multiple exceptions?

no

 

1.3 Was there a long propagation of the failure?

no

 

1.4 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

single file

 

Catastrophic? (spreadsheet column)

no

 

2. How to reproduce this failure

2.0 Version

1.0.0

2.1 Configuration

Keyspace: testks:

Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy

Durable Writes: true

Options: [168:1]

Column Families:

ColumnFamily: testcf

Key Validation Class: org.apache.cassandra.db.marshal.BytesType

Default column value validator: org.apache.cassandra.db.marshal.BytesType

Columns sorted by: org.apache.cassandra.db.marshal.BytesType

Row cache size / save period in seconds / keys to save : 0.0/0/all

Key cache size / save period in seconds: 200000.0/14400

GC grace seconds: 864000

Compaction min/max thresholds: 4/32

Read repair chance: 1.0

Replicate on write: true

Built indexes: []

Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy

 

# of Nodes?

1

2.2 Reproduction procedure

1) write the below file to cassandra (file write)

=> (column=a, value=38383838383838383838, timestamp=1317421952793000)

=> (column=d, value=617364646661736466, timestamp=1317420968944000)

=> (column=e, value=38383838383838383838, timestamp=1317421096152000)

=> (column=f, value=33343334333433343334, timestamp=1317422838818000)

=> (column=g, value=33343334333433343334, timestamp=1317422565130000)

=> (column=i, value=33343334333433343334, timestamp=1317422879258000)

=> (column=j, value=33343334333433343334, timestamp=1317422814873000)

=> (column=o, value=33343334333433343334, timestamp=1317422867106000)

=> (column=x, value=33343334333433343334, timestamp=1317422394097000)

=> (column=z, value=38383838383838383838, timestamp=1317421982057000)

2) flush data such that it happens to flush to into multiple SSTables (feature start)

3) query the flushed result with for all columns (file read)

 

Num triggering events

3

 

2.2.1 Timing order (Order important column)

yes

2.2.2 Events order externally controllable? (Order externally controllable? column)

no, multi-thread, data-race

2.3 Can the logs tell how to reproduce the failure?

yes

2.4 How many machines needed?

1

2.5 How hard is the reproduction?

hard. Cannot control the exact timing

3. Diagnosis procedure

Error msg?

yes

3.1 Detailed Symptom (where you start)

The user has the data below already in cassandra stored in memtable:

=> (column=a, value=38383838383838383838, timestamp=1317421952793000)

=> (column=d, value=617364646661736466, timestamp=1317420968944000)

=> (column=e, value=38383838383838383838, timestamp=1317421096152000)

=> (column=f, value=33343334333433343334, timestamp=1317422838818000)

=> (column=g, value=33343334333433343334, timestamp=1317422565130000)

=> (column=i, value=33343334333433343334, timestamp=1317422879258000)

=> (column=j, value=33343334333433343334, timestamp=1317422814873000)

=> (column=o, value=33343334333433343334, timestamp=1317422867106000)

=> (column=x, value=33343334333433343334, timestamp=1317422394097000)

=> (column=z, value=38383838383838383838, timestamp=1317421982057000)

The user then flushes these data to SSTable. It just happend that the data cannot fit inside the current SSTable. Cassandra flushes portions of data to the current table and at the same time flushes the other portion to new SSTable.  

The user then verify the flushed result by reading from SSTable. The received result is below:

 /var/lib/cassandra/data/testks/testcf-h-10-Data.db

{ "61": [["0c","76",1317405903119000], ["0d","76",1317405977002000], ["7a","38383838383838383838",1317422276322000]], "62": [["61","38383838383838383838",1317421952793000], ["63","4e864303",1317421827329000,"d"], ["64","617364646661736466",1317420968944000], ["65","38383838383838383838",1317421096152000], ["67","33343334333433343334",1317422565130000], ["78","33343334333433343334",1317422394097000], ["7a","38383838383838383838",1317421982057000]] }

/var/lib/cassandra/data/testks/testcf-h-12-Data.db

{ "62": [["6a","33343334333433343334",1317422814873000]] }

/var/lib/cassandra/data/testks/testcf-h-13-Data.db

{ "62": [["66","33343334333433343334",1317422838818000]] }

/var/lib/cassandra/data/testks/testcf-h-14-Data.db

{ "62": [["6f","33343334333433343334",1317422867106000]] }

/var/lib/cassandra/data/testks/testcf-h-15-Data.db

{ "62": [["69","33343334333433343334",1317422879258000]] }

3.2 Backward inference

After taking at the incorrect output, we can see the result is not totally random. The result seems to be interleaved with some other portion of other files. Looks like a race condition for concurrent operation. A close examination of code reveals the problem at ColumnFamilyStoreMBean function. The get TOP level procedure is not threadsafe.

3.3 Are the printed log sufficient for diagnosis?

yes

3.4 Are logs misleading?

no

3.5 Do we need to examine different component’s log for diagnosis?

no

3.6 Is it a multi-components failure?

no

3.7 How hard is the diagnosis?

medium

 

4. Root cause

Non-thread safe procedure when running “getTOpLevelColumns” function.

4.1 Category:

concurrency bug

4.2 Are there multiple fault?

no

4.2 Can we automatically test it?

yes

5. Fix

5.1 How?

Changed the procedure to threadsafe to prevent race condition and guarantee atomic operation

@@ -1125,7 +1125,7 @@ public class ColumnFamilyStore implements ColumnFamilyStoreMBean
        if ((cached = rowCache.get(key)) == null)
        {
            // We force ThreadSafeSortedColumns because cached row will be accessed concurrently
-            cached = getTopLevelColumns(QueryFilter.getIdentityFilter(key, new QueryPath(columnFamily)), Integer.MIN_VALUE, ThreadSafeSortedColumns.factory());
+            cached = getTopLevelColumns(QueryFilter.getIdentityFilter(key, new QueryPath(columnFamily)), Integer.MIN_VALUE, true);
            if (cached == null)
                return null;

@@ -184,6 +187,9 @@ public class CollationController
    private ColumnFamily collectAllData()
    {
        logger.debug("collectAllData");
+        ISortedColumns.Factory factory = mutableColumns
+                                       ? ThreadSafeSortedColumns.factory()
+                                       : ArrayBackedSortedColumns.factory();

5.2 Exception behavior?

no more exception

5.3 # of discussion threads?

7

6.Any interesting findings?

no

 

7. Scope of the failure

single file