CASSANDRA-4022

1. Symptom

Compaction of Cassandra hints can get stuck in a loop -- the thread doing compaction will hang (but directly user-visible).

 

Category (in the spreadsheet):

Performance

1.1 Severity

Critical

1.2 Was there exception thrown? (Exception column in the spreadsheet)

No, there is no exception thrown

 

1.2.1 Were there multiple exceptions?

no

 

1.3 Was there a long propagation of the failure?

no, the problem does not have long propagation. It is immediately affected.

1.4 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

Single client

 

Catastrophic? (spreadsheet column)

no.

2. How to reproduce this failure

2.0 Version

1.2.0 beta 1

2.1 Configuration

Standard default configuration

 

# of Nodes?

1

2.2 Reproduction procedure

1. start node

2. force compaction check in ACS kicks off

3. observe looping compaction on the hints, but only compacting the last SSTable iteratively

 

Num triggering events

1

2.2.1 Timing order (Order important column)

Yes, need to trigger compaction of hints

2.2.2 Events order externally controllable? (Order externally controllable? column)

yes

2.3 Can the logs tell how to reproduce the failure?

no

2.4 How many machines needed?

1

3. Diagnosis procedure

Error msg?

Yes

3.1 Detailed Symptom (where you start)

In the logs, we first see an iterative compaction on one file.

INFO 17:41:35,682 Compacting [SSTableReader(path='/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-339-Data.db')]

 INFO 17:41:36,430 Compacted to [/var/lib/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-hd-340-Data.db,].  4,637,160 to 4,637,160 (~100% of original) bytes

for 1 keys at 5.912220MB/s.  Time: 748ms.

3.2 Backward inference

Looking at the logs, we can see that the machine does not hand anything off, so everything in these SStables must be tombstones. Then CASSANDRA-3442 was believed to have caused the issue where it is performing single SSTable compaction if its droppable tombstone ratio is above threshold. However, there is a special case which compaction does not drop tombstone when the key in compacting sstable appears in another sstable.

3.3 Are the printed log sufficient for diagnosis?

no

 

4. Root cause

4.1 Category:

Semantic

4.2 Are there multiple fault?

no

4.2 Can we automatically test it?

yes

5. Fix

5.1 How?

The fix changed the way a SSTable will be purged. The old method is performing single SSTable compaction if its droppable tombstone ratio is above threshold. However, there is a case where compaction does not drop tombstone. The new method supposed to drop tombstone is only done when the key tombstone belongs to does not appear in other sstables that are not compacting.