CASSANDRA-2435

1. Symptom

Auto bootstrap happened on already bootstrapped nodes. This bug will be triggered when the following events happen:

- data move (when address space is changed)

- restart the non-seeded machines

This will result the restarted machines to do the auto-bootstrap (which they shouldn’t). This will change the address space organization, so the data that were hosted by these restarted machines won’t be reachable since the address space changed. If some data is not replicated on the unaffected machines, then these data effectively won’t be visible to the users (though the data is still on those restarted machines).

 

Category (in the spreadsheet):

Data loss

1.1 Severity

critical

1.2 Was there exception thrown? (Exception column in the spreadsheet)

Yes (when they read the data)

 

1.2.1 Were there multiple exceptions?

No

1.3 Was there a long propagation of the failure?

No. Only a restart is required

 

1.4 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

All clients reading those data hosted on the affected nodes.

 

Catastrophic? (spreadsheet column)

No (not really a dataloss, but the user may be tricked to believe it is)

2. How to reproduce this failure

2.0 Version

0.72

2.1 Configuration

1 seed node, 1 non-seed node

# of Nodes?

2 (need 2 nodes to move)

2.2 Reproduction procedure

Move the a cluster with seed node and non-seed node. Then restart. Non-seed node will initiate bootstrap procedure again.

Events column in the spreadsheet.

1. move

2. restart

 

Num triggering events

2

 

2.2.1 Timing order (Order important column)

yes

2.2.2 Events order externally controllable? (Order externally controllable? column)

yes

2.3 Can the logs tell how to reproduce the failure?

yes

2.4 How many machines needed?

At least two

3. Diagnosis procedure

Error msg?

yes (when you read data)

3.1 Detailed Symptom (where you start)

When auto-bootstrap is set to true and a cluster move is performed, auto bootstrap will initiate on non-seed nodes.

3.2 Backward inference

Move consists of unbootstrap and bootstrap to new location. There was a refactoring in 0.7 development where setBootstrapped(true) was removed from finishBootstrapping. So when restarted, Cassandra does not know the non-seed node has already bootstrapped thus goes into autobootstrap code path.

 

4. Root cause

At the end of the move operation, developers forgot to set the flag indicating it has already bootstrapped to true. Therefore after it restarted, the logic will check this flag, and it sees it has not been bootstrapped (because the flag is false), and proceeds to perform the auto-bootstrap.

The fix is simply set the flag to true at the end of move operation.

4.1 Category:

semantic

4.2 Are there multiple fault?

no

4.2 Can we automatically test it?

No

5. Fix

5.1 How?

The fix is simply set the flag to true at the end of move operation.

6.Any interesting findings?

Refactoring and forgetting something small can cause a catastrophic failure.