HDFS-1921

1. Symptom

NN unable to restart or start ---- failover feature will be disabled...

1.1 Severity

Blocker

1.2 Was there exception thrown?

Yes.

1.2.1 Were there multiple exceptions?

Yes. IOException (mishandled) + later unable to start exception

1.3 Scope of the failure

Entire data center because any NN will not start after this.

2. How to reproduce this failure

2.0 Version

0.22.0

2.1 Configuration

Standard

2.2 Reproduction procedure

1. Trigger the namespace saving (feature start)

2. save image encounter exception from underlying file system (data corrupt)

2.2.1 Timing order

In this order

2.2.2 Events order externally controllable?

No. 2 has to arrive at a particular time:

 FSImage.saveNamespace(...), does the following:

  1. rename current to lastcheckpoint.tmp for all of them,
  2. save image and recreate edits for all of them,
  3. rename lastcheckpoint.tmp to previous.checkpoint.

The IOException needs to occur btw. 2 and 3 for this failure to trigger.

2.3 Can the logs tell how to reproduce the failure?

Yes.

2.4 How many machines needed?

2 (NN + the other failing over NN)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

NN failed to start because it cannot load the edits.

3.2 Backward inference

From the log, you can also see that previously there was an exception when NN was to save image in saveNamespace.

3.3 Are the printed log sufficient for diagnosis?

Yes

4. Root cause

The developers simply didn’t anticipate the error occurred in saving the image.

-      FSImageSaver saver = new FSImageSaver(sd, errorSDs);
-      Thread saveThread = new Thread(saver, saver.toString());
-      saveThreads.add(saveThread);
-      saveThread.start();
+      if (errorSDs.contains(sd)) {
+        continue;
+      }
+      try {
+        FSImageSaver saver = new FSImageSaver(sd, errorSDs);
+        Thread saveThread = new Thread(saver, saver.toString());
+        saveThreads.add(saveThread);
+        saveThread.start();
+      } catch (Exception e) {
+        LOG.error("Failed save to edits directory " + sd.getRoot(), e);
+        errorSDs.add(sd);
+        continue;
+      }

This error (IOException), in the buggy version, was eventually caught by an upper-level function:

 public void run() {

    // Check the size of the edit log once every 5 minutes.

    long periodMSec = 5 * 60;   // 5 minutes

    if(checkpointPeriod < periodMSec) {

      periodMSec = checkpointPeriod;

    }

    periodMSec *= 1000;

    long lastCheckpointTime = 0;

    if (!backupNode.shouldCheckpointAtStartup()) {

      lastCheckpointTime = now();

    }

    while(shouldRun) {

      try {

        long now = now();

        boolean shouldCheckpoint = false;

        if(now >= lastCheckpointTime + periodMSec) {

          shouldCheckpoint = true;

        } else {

          long txns = countUncheckpointedTxns();

          if(txns >= checkpointTxnCount)

            shouldCheckpoint = true;

        }

        if(shouldCheckpoint) {

          doCheckpoint();

          lastCheckpointTime = now;

        }

      } catch(IOException e) {

        LOG.error("Exception in doCheckpoint: ", e);

      } catch(Throwable e) {

        LOG.error("Throwable Exception in doCheckpoint: ", e);

        shutdown();

        break;

      }

4.1 Category:

Incorrect error handling (handled, missed error subtype).

As you can see clearly from patch above, they did not anticipate that an error can be thrown in step 2 in the sequence above.

The handler of this IOException, however, is empty!