CASSANDRA-4314

1. Symptom

User sees some nodes crash due to OOM. The user was only updating (writing) to the columns that were indexed.

 

Category (in the spreadsheet):

resource leak

 

1.1 Severity

Critical

1.2 Was there exception thrown? (Exception column in the spreadsheet)

Yes. OOM.

 

1.2.1 Were there multiple exceptions?

no

 

1.3 Was there a long propagation of the failure?

yes

1.4 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

Clients who are using those nodes.

 

Catastrophic? (spreadsheet column)

no

2. How to reproduce this failure

2.0 Version

1.0.10

2.1 Configuration

basic configuration

 

# of Nodes?

1

2.2 Reproduction procedure

start cassandra

Events column in the spreadsheet.

1. indexed column (feature start)

2. many updates to such columns (file write)

Eventually you will see OOM

Num triggering events

2

 

2.2.1 Timing order (Order important column)

In this order

2.2.2 Events order externally controllable? (Order externally controllable? column)

yes

2.3 Can the logs tell how to reproduce the failure?

Yes.  INFO [FlushWriter:1] 2012-06-07 10:52:09,078 Memtable.java (line 246) Writing Memtable-LocationInfo@91455740(29/36 serialized/live bytes, 1 ops)

2.4 How many machines needed?

1

3. Diagnosis procedure

Error msg?

OOM. There is even warning message before the final OOM:

WARN [ScheduledTasks:1] 2012-06-07 12:09:40,671 GCInspector.java (line 145) Heap is 0.9158073167593992 full.  You may need to reduce memtable and/or cache sizes.  Cassandra will now flush up to the two largest memtables to free up memory.  Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically

3.1 Detailed Symptom (where you start)

database (now at 1.0.10) is in a state in which it goes out of memory with hardly any activity at all. A key slice nothing more. Then client times out.

3.2 Backward inference

Out of memory problem was found. The pattern only shows  during compaction of tombstones. We also see the user doing lots of indexed column overwrite which generates “deletes” in the indexed column family.

 

4. Root cause

Each write on an indexed column will be translated internally into deletes and create (log-structured). Compaction is supposed to remove those deleted data (called tombstones). However, in the buggy version, compaction only removes tombstone if they are older than gc_grace_seconds on purely local table --- and this number is 10 days by default. Therefore the compaction won’t take any effects in removing these tombstones.  

4.1 Category:

semantic

4.2 Are there multiple fault?

no

4.2 Can we automatically test it?

yes

5. Fix

5.1 How?

Overided the garbage collection grace period to 0  for  pure local table instead of inheriting the default value from parent. This way, tombstone would be immediately removed.

--- a/src/java/org/apache/cassandra/config/CFMetaData.java
+++ b/src/java/org/apache/cassandra/config/CFMetaData.java
@@ -251,7 +251,7 @@ public final class CFMetaData
                             .keyValidator(info.getValidator())
                             .keyCacheSize(0.0)
                             .readRepairChance(0.0)
-                             .gcGraceSeconds(parent.gcGraceSeconds)
+                             .gcGraceSeconds(0)
                             .minCompactionThreshold(parent.minCompactionThreshold)
                             .maxCompactionThreshold(parent.maxCompactionThreshold);
    }