HDFS-4423 report

1. Symptom

Checkpoint exception causes fatal damage to fsimage

1.1 Severity


1.2 Was there exception thrown?

Yes. The triggering condition was NN restarted during checkpointing.

1.2.1 Were there multiple exceptions?

Yes. 1. NN restart; 2. fsimage fatal damage.

2. How to reproduce this failure

2.0 Version


2.1 Configuration


2.2 Reproduction procedure

1. Force a checkpoint in 2nd NN (feature start)

2. NN rollFSImage (feature start)

3. shutdown NN (disconnect)

4. start NN (add node)

2.2.1 Timing order

In this order

2.2.2 Events order externally controllable?

No. This is not a deterministic bug. It is only triggered when NN shuts down occur right after latestNameCheckpointTime & latestEditsCheckpointTime.

2.3 Can the logs tell how to reproduce the failure?


2.4 How many machines needed?

2. NN + 2NN.

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

You have the massive dataloss (because the FSImage was broken). loadFSImage fails:

    throw new IOException("Inconsistent storage detected, " +

                      "image and edits checkpoint times do not match. " +

                      "image checkpoint time = " + latestNameCheckpointTime +

                      "edits checkpoint time = " + latestEditsCheckpointTime);

3.2 Backward inference

The failure was caused when the last time the logic in handling an error case was wrong:

    // Load latest edits

    if (latestNameCheckpointTime > latestEditsCheckpointTime)

      // the image is already current, discard edits ←- This is an error case, caused by NN failed in a particular timing (see above 4 events), and it was handled incorrectly!

      needToSave |= true;

4. Root cause

When NN fails at a particular timing, causing latestNameCheckpointTime is larger than latestEditsCheckpointTime, loadFSImage didn’t handle it correctly resulting in FSImage damaged.

4.1 Category:

Incorrect error handling (handled, statement coverage).

4.2 Are there multiple fault?

Yes: 1. NN shutdown, 2. bug.