CASSANDRA-6181

1. Symptom

Replaying a commit log during cluster start-up led to java.lang.StackOverflowError and node crash

 

Category (in the spreadsheet):

early termination,

1.1 Severity

Critical

1.2 Was there exception thrown? (Exception column in the spreadsheet)

yes (java.lang.StackOverflowError)

 

1.2.1 Were there multiple exceptions?

no

1.3 Was there a long propagation of the failure?

no

1.4 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

entire fs

 

Catastrophic? (spreadsheet column)

no 

2. How to reproduce this failure

2.0 Version

1.2.8

2.1 Configuration

Standard Configuiration

 

# of Nodes?

1

2.2 Reproduction procedure

1. Perform a lot of updates or deletes, in order to generate a lot of tombstones (file write)

2. Restart the server. (restart)

 

2.2.1 Timing order (Order important column)

NA

2.2.2 Events order externally controllable? (Order externally controllable? column)

yes

2.3 Can the logs tell how to reproduce the failure?

yes

2.4 How many machines needed?

1

3. Diagnosis procedure

Error msg?

yes. java.lang.StackOverflowError and node crash

3.1 Detailed Symptom (where you start)

received java.lang.StackOverflowError and node crash.

3.2 Backward inference

When looked at the function for iterating over all of the range tombstones, we found out that it is a recursive algorithm. The problem shows up when the iteration becomes large enough to max out the memory allocated for the stack.

 

4. Root cause

Out of memory on the stack because the use of recursive algorithm with large dataset.

4.1 Category:

sematic

4.2 Are there multiple fault?

no

4.2 Can we automatically test it?

yes

5. Fix

5.1 How?

Rewrote the algorithm using non-recusive way.