Checkpoint exception causes fatal damage to fsimage
Yes. The triggering condition was NN restarted during checkpointing.
Yes. 1. NN restart; 2. fsimage fatal damage.
1. Force a checkpoint in 2nd NN (feature start)
2. NN rollFSImage (feature start)
3. shutdown NN (disconnect)
4. start NN (add node)
In this order
No. This is not a deterministic bug. It is only triggered when NN shuts down occur right after latestNameCheckpointTime & latestEditsCheckpointTime.
2. NN + 2NN.
You have the massive dataloss (because the FSImage was broken). loadFSImage fails:
throw new IOException("Inconsistent storage detected, " +
"image and edits checkpoint times do not match. " +
"image checkpoint time = " + latestNameCheckpointTime +
"edits checkpoint time = " + latestEditsCheckpointTime);
The failure was caused when the last time the logic in handling an error case was wrong:
// Load latest edits
if (latestNameCheckpointTime > latestEditsCheckpointTime)
// the image is already current, discard edits ←- This is an error case, caused by NN failed in a particular timing (see above 4 events), and it was handled incorrectly!
needToSave |= true;
When NN fails at a particular timing, causing latestNameCheckpointTime is larger than latestEditsCheckpointTime, loadFSImage didn’t handle it correctly resulting in FSImage damaged.
Incorrect error handling (handled, statement coverage).
Yes: 1. NN shutdown, 2. bug.