CASSANDRA-4782

1. Symptom

Commitlog not replayed after restart. Either after a reboot of a server + restart of cassandra (1.1.0 to 1.1.4) or After doing an upgrade from cassandra 1.1.X to cassandra 1.1.5

Category (in the spreadsheet):

data loss

1.1 Severity

Critical

1.2 Was there exception thrown? (Exception column in the spreadsheet)

Yes. "skipping replay of fully-flushed {}"

 

1.2.1 Were there multiple exceptions?

no

1.3 Was there a long propagation of the failure?

no. Only a restart is required.

 

1.4 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

entire fs

 

Catastrophic? (spreadsheet column)

no

2. How to reproduce this failure

2.0 Version

1.1.6

2.1 Configuration

No special configuration required

 

# of Nodes?

1

2.2 Reproduction procedure

There are two ways to reproduce this problem. Both ways will change the “base/reference” time which is needed to calculate the timestamp.

For Cassandra version 1.1.0-1.1.4: reboot server and restart cassandra

For Cassandra 1.1.x upgrading to 1.1.5: upgrading process will trigger this problem

Events:

restart/upgrade (restart)

 

Num triggering events

1

 

2.2.1 Timing order (Order important column)

NA

2.2.2 Events order externally controllable? (Order externally controllable? column)

yes

2.3 Can the logs tell how to reproduce the failure?

yes

2.4 How many machines needed?

1

3. Diagnosis procedure

Error msg?

yes ("skipping replay of fully-flushed {}")

3.1 Detailed Symptom (where you start)

Saw "skipping replay of fully-flushed {}" message even though we know there should be uncommitted data in the commit log.

3.2 Backward inference

We look back at the code. We hit the skipping replay branch if the commit log timestamp is greater than the required replay point (thus theoretically do not need to be replay). Timestamp(System.nanoTime()) is comprised of base/reference +  offset. System.nanoTime() is using the boot time as the base/reference. Thus with a reboot, there can be cases where System.nanoTime() can return a lower number than before reboot).

Note: According to Javadoc,  the reference is a relative point in time

4. Root cause

Because System.nanoTime() is using the boot time as the base/reference, thus with a reboot, there can be cases where System.nanoTime() can return a lower number than before reboot).

4.1 Category:

Semantic

4.2 Are there multiple fault?

no

4.2 Can we automatically test it?

yes

5. Fix

5.1 How?

Added a code snippet to detect if its running old version of Cassandra. If it is, then set the replayPosition to the beginning.

if (!desc.metadataIncludesModernReplayPosition)
{          

// replay position may be "from the future" thanks to older versions generating them with nanotime.
// make sure we don't omit replaying something that we should.  see CASSANDRA-4782
replayPosition = ReplayPosition.NONE

}

6.Any interesting findings?

Surprising how a such primitive function can cause such disastrous failure.