Cassandra-3446

1. Symptom

Background:

Multiget_slice: Takes in a list of keys to go fetch them in the cassandra database. It is usually used fetch a large number of specific rows according to the row key.

Similar SQL command: SELECT * FROM table where Primark_Key IN (2, 32, 76, 1000, 2427)

The user is having a problem with the value returned by multiget_slice command on a super column family. After updating a value in super column family, retrieving the updated value returns the old value prior to updating.

 

Category:

wrong computation

1.1 Severity

Critical

1.2 Was there exception thrown? (Exception column in the spreadsheet)

no, cassandra returns wrong (outdated) result without any error or warning

 

1.2.1 Were there multiple exceptions?

no

1.3 Was there a long propagation of the failure?

no 

1.4 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

single file

 

Catastrophic? (spreadsheet column)

no 

2. How to reproduce this failure

2.0 Version

1.0.3

2.1 Configuration

standard configuration with 1 node

 

# of Nodes?

1

2.2 Reproduction procedure

1) Create one or more super column entries (file write)

2) Use nodetool to flush the column family (feature start)

3) Update the sub column values(file write)

4) Read the updated value(file read)

 

Num triggering events

4

2.2.1 Timing order (Order important column)

yes

2.2.2 Events order externally controllable?

yes

2.3 Can the logs tell how to reproduce the failure?

yes

2.4 How many machines needed?

1

2.5 How hard is the reproduction?

easy

3. Diagnosis procedure

Error msg?

no, no error message or warnings. Just wrong result.

3.1 Detailed Symptom (where you start)

User is having a problem with doing a multiget_slice on a super column family

after its first flush. Updates to the column values work properly, but

trying to retrieve the updated values using a multiget_slice operation fail

to get the updated values. Instead they return the values from before the

flush. The problem is not apparent with standard column families.

3.2 Backward inference

There is not a obvious backward inference for this besides a developer giving insight to a similar problem through domain knowledge. The other problem appears to have trouble in collating data (putting data together to form result) from Memtables and SStables, but only when query involves SuperColumns. The developer did not have a proper fix. It was not tested for regressions or concurrency, but this lead gave a direction for other developers. Thus two more problem which relates to this failure is found. The first error is that the name-based path in CollationController stops as soon as it finds one subcolumn in a given supercolumn. The second one can hide the first one. SuperColumn.minTimestamp is calculated incorrectly. SuperColumn.minTimestamp is used to “short-circuit” a Supercolumn read for optimization. It causes problem here because we don’t know how many potential subcolumns within a super column without an exhaustive search. SuperColumn.minTimestamp sets a limit for iterative Search of subcolumns.

3.3 Are the printed log sufficient for diagnosis? 

no

3.4 Are logs misleading?

no, logs doesn't really tell give us useful information

3.5 Do we need to examine different component’s log for diagnosis?

no

3.6 Is it a multi-components failure?

yes. This is failure is triggered by 3 problems with cassandra.

Two bugs during collation and one with SuperColumn.minstamp prematurely terminated the iterative search for subcolumns.

3.7 How hard is the diagnosis?

Very hard. Depends on domain knowledge.

 

4. Root cause

There are 3 root causes to this failure.

1) During collating data from Memtables and SSTables, developer used an algorithm called treemap. They algorithm they used was not properly written for super column.

2) During collating data from Memtables and SSTables, resolving path stops as soon as it finds one subcolumn in a given super column.

3) SuperColumn.minTimestamp value prematurely erminated the iterative search for subcolumns.

Summary: all three root causes affect the collation of super column data.

4.1 Category:

semantic

4.2 Are there multiple fault?

yes

4.2 Can we automatically test it?

yes

5. Fix

5.1 How?

Fix was simple:

1) tree map algorith fix: fix the problem where column in treeMapBackedSorted Columns cannot be resolved

/*
    * If we find an old column that has the same name
    * the ask it to resolve itself else add the new column
   */
   public void addColumn(IColumn column, Allocator allocator)
   {
       ByteBuffer name = column.name();
       IColumn oldColumn = put(name, column);
       if (oldColumn != null)
       {
           if (oldColumn instanceof SuperColumn)
           {
               assert column instanceof SuperColumn;

+              // since oldColumn is where we've been accumulating results, it's usually going to be faster to
+              // add the new one to the old, then place old back in the Map, rather than copy the old contents
+              // into the new Map entry.
               ((SuperColumn) oldColumn).putColumn((SuperColumn)column, allocator);

+              put(name,  oldColumn);
           }
2) path resolvation fix: resolving path stops as soon as it finds one subcolumn in a given super column.

    public ColumnFamily getTopLevelColumns()
    {
-        return filter.filter instanceof NamesQueryFilter && cfs.metadata.getDefaultValidator() != CounterColumnType.instance
+        return filter.filter instanceof NamesQueryFilter
+               && (cfs.metadata.cfType == ColumnFamilyType.Standard || filter.path.superColumnName != null)
+               && cfs.metadata.getDefaultValidator() != CounterColumnType.instance
               ? collectTimeOrderedData()
               : collectAllData();
    }

3) SuperColumn.minTime fix: resolving path stops as soon as it finds one subcolumn in a given super column.

        {
            ByteBuffer filterColumn = iterator.next();
            IColumn column = container.getColumn(filterColumn);
-            if (column != null && column.minTimestamp() > sstableTimestamp)
+            if (column != null && column.timestamp() > sstableTimestamp)
                iterator.remove();
        }
    }

 

6. Scope of the failure

single super column / single file (containing super column file structure)